Single Table Synthesis#

Overview#

The Synthesized SDK contains the synthesized.HighDimSynthesizer object which is designed to synthesize data from single tables easily and accurately, see the quickstart for an introduction. On this page we go through that same process in more detail.

The HighDimSynthesizer uses advanced generative modelling techniques to produce synthetic data that closely resembles the original.

Important

Whilst the HighDimSynthesizer works across a variety of use cases, it assumes that there is no temporal or conditional dependencies the between rows; each row in the table is assumed to be independent and identically distributed. This means that time-series data or data where rows are dependent between each other will not be correctly synthesized and those dependencies can be lost.

To integrate your structured data into the Synthesized package, it must be parsed using the common data science package Pandas as a DataFrame. Pandas allows for powerful manipulation of structured data and is an essential part of most data analysis in Python.

In this example we call the utility function synthesized.util.get_example_data() which produces a small DataFrame ready to be synthesized. The full flow from the original DataFrame to synthesis is detailed below:

In [1]: import synthesized # import synthesized package

In [2]: df = synthesized.util.get_example_data() # grab example data frame

In [3]: df_meta = synthesized.MetaExtractor.extract(df, annotations=...) # extract metadata information

In [4]: synthesizer = synthesized.HighDimSynthesizer(df_meta) # create synthesizer object

In [5]: synthesizer.learn(df) # learn the original data

In [6]: synthesizer.synthesize(num_rows=) # synthesize new data

Below, we go through each of these steps in detail.

Metadata#

In [7]: df_meta = synthesized.MetaExtractor.extract(df, annotations=...) # extract metadata information

Before creating a Synthesizer object we first need to extract meta information about the data, this looks at the DataFrame and tries to deduce things like:

  • Is this column categorical or continuous?

  • What is the domain of this data?

  • Is it a special type (date, address, etc..)

This can then be used to inform the Synthesizer how to model the data. The main method of extracting this information is using the synthesized.MetaExtractor class. In particular, the .extract(df) method. Learn more about this in the documentation.

At this stage annotations can be passed to the MetaExtractor, this is essential to generate realistic fake PII such as customer names and addresses.

HighDimSynthesizer#

In [8]: synthesizer = synthesized.HighDimSynthesizer(df_meta) # create synthesizer object

The next stage builds a blank generative model of the data ready for the learning process. The HighDimSynthesizer is the workhorse of the Synthesized package, it uses cutting edge techniques in generative modelling to learn the data.

To learn more about how to interact with this object look at the api documentation.

Saving and Loading Models#

To save a model as a binary file use the HighDimSynthesizer.export_model method.

In [9]: with open("example.synth", "wb") as file_handle:
   ...:     synthesizer.export_model(file_handle)
   ...: 

to import this model into a new HighDimSynthesizer instance, use the static method HighDimSynthesizer.import_model

In [10]: with open("example.synth", "rb") as file_handle:
   ....:     synthesizer2 = synthesized.HighDimSynthesizer.import_model(file_handle)
   ....: 

Training#

Now the data can be learnt

In [11]: synthesizer.learn(df)

Note

fit(...) can also be used as an alias of the learn(...) method.

Depending on the size of the dataset this process could take a few minutes to complete. Here the HighDimSynthesizer will learn patterns present in the data so that it can generate them later.

The num_iterations argument in synthesizer.learn can be set to a specific value in order to constrain the number of learning steps of the Synthesizer. This can be particularly useful for testing any pipelines containing the HighDimSynthesizer before trying to Synthesize data properly.

If a large value is provided to num_iterations the Synthesizer may decide to end training early regardless, so increasing training time is not possible in this way. It is possible to force the Synthesizer to train for longer by calling .learn additional times. The Synthesizer has been designed to learn the dataset in a single call so this should not be necessary in most cases.

Note

Whilst the HighDimSynthesizer can use a GPU to improve training time, we mostly encourage CPU training for now. As the dataset is loaded into memory as a pandas DataFrame, the memory usage of the system might need to be tracked to ensure it is not used up and the operating system starts to swap data out to disk.

Training on large amounts of data#

The HighDimSynthesizer is designed to learn from data that is fully loaded into a system’s memory, this may be infeasible for a large (possibly distributed) data source. For many datasets, we recommend taking a sub-sample of the dataset and training on that. Whilst this may miss out on some data points, for a lot of cases it’s very easy to store about a million rows of data into memory which should give a good sample for the Synthesizer to train on.

Note

If you’re finding that sub-sampling data isn’t achieving the quality of data you’re interested in, please get in touch and we can discuss other solutions or potential improvements to the Synthesized package to help with your use-case.

Synthesis#

Finally, the Synthesizer can be used to generate data:

In [12]: df_synth = synthesizer.synthesize(num_rows=1000)

Note

sample(...) can also be used an alias of the synthesize(...) method.

This will generate a DataFrame with the required number of rows. The process should be very quick in comparison to the time spent training the HighDimSynthesizer.

By default, the Synthesizer will not generate missing values even if they exist in the original data. Altering the produce_nans argument will force it to generate missing values in a pattern that is common with the input dataset:

In [13]: df_synth = synthesizer.synthesize(num_rows=1000, produce_nans=True)

Additional rules or constraints on how the data is generated can also be specified with the ConditionalSampler class as detailed in the Data Rebalancing guide.