Synthesized can be used to quickly generate realistic high dimensional tabular data. This is achieved by learning a high dimensional, intelligent model of the original data that preserves the statistical properties, distributions and correlations between fields.


Synthesized interfaces with the pandas package, as such the data to be synthesized must be first loaded into a pandas.DataFrame object.

We load some example data from the package

In [1]: import synthesized

In [2]: df = synthesized.util.get_example_data()

This data is returned in a pandas.DataFrame format, it is then relatively straightforward to generate new data using the code snippet below.

In [3]: df_meta = synthesized.MetaExtractor.extract(df) # extract metadata

In [4]: synth = synthesized.HighDimSynthesizer(df_meta) # instantiate model

In [5]: synth.learn(df, num_iterations=None) # train model

In [6]: df_synth = synth.synthesize(num_rows=1000) # synthesize new data

These 4 actions (extract metadata, instantiate model, train model, generate data) is the main flow for generating data with the synthesized engine. Learn more about this in the Single Table Synthesis guide. An API reference for metadata and synthesis can be found here:

This flow can then be extended with annotations, conditional sampling, business logic.

By default, Synthesized imputes any missing values. To generate NaN values, use produce_nans=True.

In [7]: df_synth = synth.synthesize(num_rows=1000, produce_nans=True)