Tabular Data

Synthesized can be used to quickly generate realistic high dimensional tabular data. This is achieved by learning a high dimensional, intelligent model of the original data that preserves the statistical properties, distributions and correlations between fields.

Synthesized interfaces with the pandas package and as such, the data to be synthesized must be first loaded into a pandas.DataFrame object.

First, load in some example data from the package:

import synthesized
df = synthesized.util.get_example_data()

This data is returned in a pandas.DataFrame format, it is then relatively straightforward to generate new data using the code snippet below.

df_meta = synthesized.MetaExtractor.extract(df) # extract metadata
synth = synthesized.HighDimSynthesizer(df_meta) # instantiate model
synth.learn(df) # train model
df_synth = synth.synthesize(num_rows=1000) # synthesize new data

These 4 actions (extract metadata, instantiate model, train model, generate data) are the main flow for generating data with the synthesized engine. Learn more about this in the Single Table Synthesis.

The synthetic data generation workflow can then be extended with Entity Annotation, Data Rebalancing, Rules.

By default, Synthesized will produce NaNs in the synthetic dataset where they exist in the original data. Synthesized has the capability to impute missing values with the generator, which can easily be done by setting produce_nans=False when you synthesize data.

df_synth = synth.synthesize(num_rows=1000, produce_nans=False)