Synthesized can be used to quickly generate realistic high dimensional tabular data. This is achieved by learning a high dimensional, intelligent model of the original data that preserves the statistical properties, distributions and correlations between fields.
We load some example data from the package
In : import synthesized In : df = synthesized.util.get_example_data()
This data is returned in a
pandas.DataFrame format, it is then relatively straightforward to generate
new data using the code snippet below.
In : df_meta = synthesized.MetaExtractor.extract(df) # extract metadata In : synth = synthesized.HighDimSynthesizer(df_meta) # instantiate model In : synth.learn(df, num_iterations=None) # train model In : df_synth = synth.synthesize(num_rows=1000) # synthesize new data
These 4 actions (extract metadata, instantiate model, train model, generate data) is the main flow for generating data with the synthesized engine. Learn more about this in the Single Table Synthesis guide. An API reference for metadata and synthesis can be found here:
By default, Synthesized imputes any missing values. To generate
In : df_synth = synth.synthesize(num_rows=1000, produce_nans=True)