Single Table Synthesis
The Synthesized SDK contains the
which is designed to synthesize data from single tables easily and accurately,
see the Quickstart for an introduction. On this page we go
through that same process in more detail.
HighDimSynthesizer uses advanced generative modelling techniques to
produce synthetic data that closely resembles the original.
To integrate your structured data into the
Synthesized package, it must be
parsed using the common data science package
Pandas as a
Pandas allows for powerful manipulation of structured data and is an essential
part of most data analysis in Python.
In this example we call the utility function
synthesized.util.get_example_data() which produces a small
ready to be synthesized. The full flow from the original DataFrame to synthesis
is detailed below:
import synthesized # import synthesized package df = synthesized.util.get_example_data() (1) df_meta = synthesized.MetaExtractor.extract(df, annotations=...) (2) synthesizer = synthesized.HighDimSynthesizer(df_meta) (3) synthesizer.learn(df) (4) synthesizer.synthesize(num_rows=len(df)) (5)
|1||Fetch example dataset.|
|2||Extract metadata information.|
|3||Create Synthesizer model.|
|4||Learn the original data.|
|5||Synthesize new data.|
Below, we go through each of these steps in detail.
df_meta = synthesized.MetaExtractor.extract(df, annotations=...) # extract metadata information
Before creating a Synthesizer object we first need to extract meta information about the data, this looks at the DataFrame and tries to deduce things like:
Is this column categorical or continuous?
What is the domain of this data?
Is it a special type (date, address, etc..)
This can then be used to inform the Synthesizer how to model the data. The main
method of extracting this information is using the
synthesized.MetaExtractor class. In particular, the
At this stage annotations can be passed to the
MetaExtractor, this is essential to generate realistic fake PII such as
customer names and addresses.
synthesizer = synthesized.HighDimSynthesizer(df_meta) # create synthesizer object
The next stage builds a blank generative model of the data ready for the
learning process. The
HighDimSynthesizer is one of the main generator objects
in the python SDK. It uses cutting edge techniques in generative modelling to
learn the data.
Saving and Loading Models
To save a model as a binary file use the
with open("example.synth", "wb") as file_handle: synthesizer.export_model(file_handle)
to import this model into a new HighDimSynthesizer instance, use the static
with open("example.synth", "rb") as file_handle: synthesizer2 = synthesized.HighDimSynthesizer.import_model(file_handle)
Now the data can be learnt
Depending on the size of the dataset this process could take a few minutes to
complete. Here the
HighDimSynthesizer will learn patterns present in the
data so that it can generate them later.
num_iterations argument in
synthesizer.learn can be set to a
specific value in order to constrain the number of learning steps of the
Synthesizer. This can be particularly useful for testing any pipelines
HighDimSynthesizer before trying to Synthesize data
If a large value is provided to
num_iterations the Synthesizer may decide
to end training early regardless, so increasing training time is not possible
in this way. It is possible to force the Synthesizer to train for longer by
.learn additional times. The Synthesizer has been designed to learn
the dataset in a single call so this should not be necessary in most cases.
Training on large amounts of data
HighDimSynthesizer is designed to learn from data that is fully loaded into
a system’s memory, this may be infeasible for a large (possibly distributed)
data source. For many datasets, we recommend taking a sub-sample of the dataset
and training on that. Whilst this may miss out on some data points, for a lot
of cases it’s very easy to store about a million rows of data into memory which
should give a good sample for the Synthesizer to train on.
If you’re finding that sub-sampling data isn’t achieving the quality of data you’re interested in, please get in touch and we can discuss other solutions or potential improvements to the Synthesized package to help with your use-case.
Finally, the Synthesizer can be used to generate data:
df_synth = synthesizer.synthesize(num_rows=1000) (1)
This will generate a DataFrame with the required number of rows. The process
should be very quick in comparison to the time spent training the
By default, the Synthesizer will generate missing values in a pattern
consistent with missing values in the the original data. Altering the
Additional rules or constraints on how the data is generated can also be
specified with the
ConditionalSampler class as detailed in the Data