Synthesized can be used to quickly generate realistic high dimensional tabular data. This is achieved by learning a high dimensional, intelligent model of the original data that preserves the statistical properties, distributions and correlations between fields.
First, load in some example data from the package:
import synthesized df = synthesized.util.get_example_data()
This data is returned in a
pandas.DataFrame format, it is then relatively straightforward to generate
new data using the code snippet below.
df_meta = synthesized.MetaExtractor.extract(df) # extract metadata synth = synthesized.HighDimSynthesizer(df_meta) # instantiate model synth.learn(df) # train model df_synth = synth.synthesize(num_rows=1000) # synthesize new data
These 4 actions (extract metadata, instantiate model, train model, generate data) are the main flow for generating data with the synthesized engine. Learn more about this in the Tabular.
By default, Synthesized will produce NaNs in the synthetic dataset where they exist in the original data.
Synthesized has the capability to impute missing values with the generator, which can easily be done by