Tabular Synthesis
Overview
The Synthesized SDK contains the synthesized.HighDimSynthesizer
object
which is designed to synthesize data from single tables easily and accurately,
see the Tabular Data for an introduction. On this page we go
through that same process in more detail.
The HighDimSynthesizer
uses advanced generative modelling techniques to
produce synthetic data that closely resembles the original.
Whilst the |
To integrate your structured data into the Synthesized
package, it must be
parsed using the common data science package
Pandas as a
DataFrame.
Pandas allows for powerful manipulation of structured data and is an essential
part of most data analysis in Python.
In this example we call the utility function
synthesized.util.get_example_data()
which produces a small DataFrame
ready to be synthesized. The full flow from the original DataFrame to synthesis
is detailed below:
import synthesized # import synthesized package
df = synthesized.util.get_example_data() (1)
df_meta = synthesized.MetaExtractor.extract(df, annotations=...) (2)
synthesizer = synthesized.HighDimSynthesizer(df_meta) (3)
synthesizer.learn(df) (4)
synthesizer.synthesize(num_rows=len(df)) (5)
1 | Fetch example dataset. |
2 | Extract metadata information. |
3 | Create Synthesizer model. |
4 | Learn the original data. |
5 | Synthesize new data. |
Below, we go through each of these steps in detail.
Metadata
df_meta = synthesized.MetaExtractor.extract(df, annotations=...) # extract metadata information
Before creating a Synthesizer object we first need to extract meta information about the data, this looks at the DataFrame and tries to deduce things like:
-
Is this column categorical or continuous?
-
What is the domain of this data?
-
Is it a special type (date, address, etc..)
This can then be used to inform the Synthesizer how to model the data. The main
method of extracting this information is using the
synthesized.MetaExtractor
class. In particular, the .extract(df)
method.
At this stage annotations can be passed to the
MetaExtractor
, this is essential to generate realistic fake PII such as
customer names and addresses.
HighDimSynthesizer
synthesizer = synthesized.HighDimSynthesizer(df_meta) # create synthesizer object
The next stage builds a blank generative model of the data ready for the
learning process. The HighDimSynthesizer
is one of the main generator objects
in the python SDK. It uses cutting edge techniques in generative modelling to
learn the data.
Saving and Loading Models
To save a model as a binary file use the HighDimSynthesizer.export_model
method.
with open("example.synth", "wb") as file_handle:
synthesizer.export_model(file_handle)
to import this model into a new HighDimSynthesizer instance, use the static
method HighDimSynthesizer.import_model
with open("example.synth", "rb") as file_handle:
synthesizer2 = synthesized.HighDimSynthesizer.import_model(file_handle)
Training
Now the data can be learnt
synthesizer.learn(df) (1)
1 | fit(…) can also be used as an alias of the learn(…) method. |
Depending on the size of the dataset this process could take a few minutes to
complete. Here the HighDimSynthesizer
will learn patterns present in the
data so that it can generate them later.
The num_iterations
argument in synthesizer.learn
can be set to a
specific value in order to constrain the number of learning steps of the
Synthesizer. This can be particularly useful for testing any pipelines
containing the HighDimSynthesizer
before trying to Synthesize data
properly.
If a large value is provided to num_iterations
the Synthesizer may decide
to end training early regardless, so increasing training time is not possible
in this way. It is possible to force the Synthesizer to train for longer by
calling .learn
additional times. The Synthesizer has been designed to learn
the dataset in a single call so this should not be necessary in most cases.
Whilst the |
Training on large amounts of data
The HighDimSynthesizer
is designed to learn from data that is fully loaded into
a system’s memory, this may be infeasible for a large (possibly distributed)
data source. For many datasets, we recommend taking a sub-sample of the dataset
and training on that. Whilst this may miss out on some data points, for a lot
of cases it’s very easy to store about a million rows of data into memory which
should give a good sample for the Synthesizer to train on.
If you’re finding that sub-sampling data isn’t achieving the quality of data you’re interested in, please get in touch and we can discuss other solutions or potential improvements to the Synthesized package to help with your use-case. |
Synthesis
Finally, the Synthesizer can be used to generate data:
df_synth = synthesizer.synthesize(num_rows=1000) (1)
1 | sample(…) can also be used an alias of the synthesize(…) method. |
This will generate a DataFrame with the required number of rows. The process
should be very quick in comparison to the time spent training the
HighDimSynthesizer
.
By default, the Synthesizer will generate missing values in a pattern
consistent with missing values in the the original data. Altering the
|
Additional rules or constraints on how the data is generated can also be
specified with the ConditionalSampler
class as detailed in the Data
Rebalancing guide.