Tabular Synthesis

Overview

The Synthesized SDK contains the synthesized.HighDimSynthesizer object which is designed to synthesize data from single tables easily and accurately, see the Tabular Data for an introduction. On this page we go through that same process in more detail.

The HighDimSynthesizer uses advanced generative modelling techniques to produce synthetic data that closely resembles the original.

Whilst the HighDimSynthesizer works across a variety of use cases, it assumes that there is no temporal or conditional dependencies the between rows; each row in the table is assumed to be independent and identically distributed. This means that time-series data or data where rows are dependent between each other will not be correctly synthesized and those dependencies can be lost.

To integrate your structured data into the Synthesized package, it must be parsed using the common data science package Pandas as a DataFrame. Pandas allows for powerful manipulation of structured data and is an essential part of most data analysis in Python.

Example 1. Synthesizing a Table

In this example we call the utility function synthesized.util.get_example_data() which produces a small DataFrame ready to be synthesized. The full flow from the original DataFrame to synthesis is detailed below:

import synthesized # import synthesized package

df = synthesized.util.get_example_data()  (1)
df_meta = synthesized.MetaExtractor.extract(df, annotations=...)  (2)
synthesizer = synthesized.HighDimSynthesizer(df_meta)  (3)
synthesizer.learn(df)  (4)
synthesizer.synthesize(num_rows=len(df))  (5)

1	Fetch example dataset.
2	Extract metadata information.
3	Create Synthesizer model.
4	Learn the original data.
5	Synthesize new data.

Below, we go through each of these steps in detail.

Metadata

df_meta = synthesized.MetaExtractor.extract(df, annotations=...) # extract metadata information

Before creating a Synthesizer object we first need to extract meta information about the data, this looks at the DataFrame and tries to deduce things like:

Is this column categorical or continuous?
What is the domain of this data?
Is it a special type (date, address, etc..)

This can then be used to inform the Synthesizer how to model the data. The main method of extracting this information is using the synthesized.MetaExtractor class. In particular, the .extract(df) method.

At this stage annotations can be passed to the MetaExtractor, this is essential to generate realistic fake PII such as customer names and addresses.

HighDimSynthesizer

synthesizer = synthesized.HighDimSynthesizer(df_meta) # create synthesizer object

The next stage builds a blank generative model of the data ready for the learning process. The HighDimSynthesizer is one of the main generator objects in the python SDK. It uses cutting edge techniques in generative modelling to learn the data.

Saving and Loading Models

To save a model as a binary file use the HighDimSynthesizer.export_model method.

with open("example.synth", "wb") as file_handle:
    synthesizer.export_model(file_handle)

to import this model into a new HighDimSynthesizer instance, use the static method HighDimSynthesizer.import_model

with open("example.synth", "rb") as file_handle:
    synthesizer2 = synthesized.HighDimSynthesizer.import_model(file_handle)

Training

Now the data can be learnt

synthesizer.learn(df)  (1)

1	`fit(…)` can also be used as an alias of the `learn(…)` method.

Depending on the size of the dataset this process could take a few minutes to complete. Here the HighDimSynthesizer will learn patterns present in the data so that it can generate them later.

The num_iterations argument in synthesizer.learn can be set to a specific value in order to constrain the number of learning steps of the Synthesizer. This can be particularly useful for testing any pipelines containing the HighDimSynthesizer before trying to Synthesize data properly.

If a large value is provided to num_iterations the Synthesizer may decide to end training early regardless, so increasing training time is not possible in this way. It is possible to force the Synthesizer to train for longer by calling .learn additional times. The Synthesizer has been designed to learn the dataset in a single call so this should not be necessary in most cases.

Whilst the HighDimSynthesizer can use a GPU to improve training time, we mostly encourage CPU training for now. As the dataset is loaded into memory as a pandas DataFrame, the memory usage of the system might need to be tracked to ensure it is not used up and the operating system starts to swap data out to disk.

Training on large amounts of data

The HighDimSynthesizer is designed to learn from data that is fully loaded into a system’s memory, this may be infeasible for a large (possibly distributed) data source. For many datasets, we recommend taking a sub-sample of the dataset and training on that. Whilst this may miss out on some data points, for a lot of cases it’s very easy to store about a million rows of data into memory which should give a good sample for the Synthesizer to train on.

If you’re finding that sub-sampling data isn’t achieving the quality of data you’re interested in, please get in touch and we can discuss other solutions or potential improvements to the Synthesized package to help with your use-case.

Synthesis

Finally, the Synthesizer can be used to generate data:

df_synth = synthesizer.synthesize(num_rows=1000)  (1)

1	`sample(…)` can also be used an alias of the `synthesize(…)` method.

This will generate a DataFrame with the required number of rows. The process should be very quick in comparison to the time spent training the HighDimSynthesizer.

By default, the Synthesizer will generate missing values in a pattern consistent with missing values in the the original data. Altering the produce_nans argument will force it to intelligently generate data with no missing values.

df_synth = synthesizer.synthesize(num_rows=1000, produce_nans=False)

Additional rules or constraints on how the data is generated can also be specified with the ConditionalSampler class as detailed in the Data Rebalancing guide.