synthesized.complex.HighDimSynthesizer

class HighDimSynthesizer(df_meta, config=None, type_overrides=None, summarizer_dir=None, summarizer_name=None)

Synthesizer that can learn to generate data from a single tabular dataset.

The data must be in a tabular format (i.e a set of columns and rows), where each row is independent and there is no temporal or conditional relation between them. The Synthesizer will learn the underlying distribution of the original data, and is capable of generating new synthetic rows of data capture maintain the correlations and associations across columns.

Parameters
  • df_meta (DataFrameMeta) – A DataFrameMeta instance that has been extracted for the desired dataset.

  • config (HighDimConfig, optional) – The configuration to use for this Synthesizer. Defaults to None, in which case the default options of HighDimConfig are used.

  • type_overrides (List[Union[ContinuousModel, DiscreteModel]], Optional) – Custom type specifciations for each column that will override the defaults inferred from the data. These must be instantiated Model classes, e.g Histogram or KernelDensityEstimate. Defaults to None, in which case the types are automatically inferred.

  • summarizer_dir (str, optional) – Path to a directory where TensorBoard summaries of the training logs will be stored. Defaults to None.

  • summarizer_name (str, optional) – A prefix for the subdirectory where trainining logs for this Synthesizer are stored. If set, logs will be stored in summarizer_dir/summarizer_name_%Y%m%d-%H%M%S, where the timestamp is set at the time of instantiation. Defaults to None.

Examples

Load dataset into a pandas DataFrame:

>>> df = pd.read_csv('dataset.csv')

Extract the DataFrameMeta:

>>> df_meta = MetaExtractor.extract(df)

Initialise a HighDimSynthesizer with the default configuration:

>>> synthesizer = HighDimSynthesizer(df_meta=df_meta)

Learn a model of the original data by training for 100 iterations:

>>> synthesizer.learn(df_train=df, num_iterations=100)

Generate 1000 rows of new data:

>>> df_synthetic = synthesizer.synthesize(num_rows=1000)

Set a column to be categorical instead of continuous:

>>> column_meta = df_meta['column_name']
>>> column_model = synthesized.model.models.Histogram(meta=column_meta)
>>> synthesizer = HighDimSynthesizer(df_meta=df_meta, type_overrides=[column_model])

Methods

__init__(df_meta[, config, type_overrides, …])

Initialize self.

export_model(fp[, title, description, author])

Save HighDimSynthesizer to file.

import_model(fp)

Load HighDimSynthesizer from file.

learn(df_train[, num_iterations, callback, …])

Learn the underlying distribution of the original data by training the synthesizer.

synthesize(num_rows[, produce_nans, …])

Generate the given number of new data rows.

synthesize_from_rules(num_rows[, …])

Generate a given number of data rows according to specified rules.

export_model(fp, title='HighDimSynthesizer', description=None, author=None)

Save HighDimSynthesizer to file.

Parameters
  • fp (BinaryIO) – File object able to write bytes-like objects.

  • title (str, optional) – Identifier for this synthesizer. Defaults to ‘HighDimSynthesizer’

  • description (str, optional) – Metadata. Defaults to None.

  • author (str, optional) – Author metadata. Defaults to None.

Examples

Open binary file and save HighDimSynthesizer:

>>> with open('synthesizer.bin', 'wb') as f:
        HighDimSynthesizer.export_model(f)
static import_model(fp)

Load HighDimSynthesizer from file.

Parameters

fp (BinaryIO) – File object able to read bytes-like objects.

Examples

Open binary file and load HighDimSynthesizer:

>>> with open('synthesizer.bin', 'rb') as f:
        synthesizer = HighDimSynthesizer.import_model(f)
learn(df_train, num_iterations=None, callback=None, callback_freq=0)

Learn the underlying distribution of the original data by training the synthesizer.

This method can be called multiple times to continue the training process.

Parameters
  • df_train (pd.DataFrame) – Training data that matches schema of the DataFrameMeta used by this synthesizer.

  • num_iterations (int, optional) The number of training iterations (not epochs) – in which case the learning process is intelligently stopped as the synthesizer converges.

  • callback (Callable, optional) – synthesizer instance, the iteration number, and a dictionary of the loss function values as arguments. Aborts training if the return value is True. Defaults to None.

  • callback_freq (int, optional) – The number of training iterations to perform before the callback is called. Defaults to 0, in which case the callback is never called.

Return type

None

synthesize(num_rows, produce_nans=False, progress_callback=None, association_rules=None)

Generate the given number of new data rows.

Parameters
  • num_rows (int) – Number of rows to generate.

  • produce_nans (bool, optional) – Generate NaN values. Defaults to False.

  • progress_callback (Callable, optional) – Progress bar callback. Defaults to None.

  • association_rules (List[Association], optional) – Association rules to apply. Defaults to None.

Return type

DataFrame

Returns

The generated data.

synthesize_from_rules(num_rows, produce_nans=False, generic_rules=None, association_rules=None, expression_rules=None, max_iter=20, progress_callback=None)

Generate a given number of data rows according to specified rules.

Conditional sampling is used to generate a dataset that conforms to the the given generic_rules. As a result, in some cases it may not be possible to generate num_rows of synthetic data if the original data contains a small number of samples where the rule is valid. Increasing max_iter may help in this situation.

Parameters
  • num_rows (int) – The number of rows to generate.

  • produce_nans (bool, optional) – Whether to produce NaNs. Defaults to False

  • generic_rules (List[GenericRule], optional) – list of GenericRule rules the output must conform to. Defaults to None.

  • association_rules (List[Association], optional) – list of Association rules to constrain the output data. Defaults to None.

  • expression_rules (List[Expression], optional) – list of Expression rules to add to the output of the synthesizer. Defaults to None.

  • max_iter (int, optional) – maximum number of iterations to try to apply generic rules before raising an error. Defaults to 20.

  • progress_callback (Callable, optional) – Progress bar callback. Defaults to None.

Return type

DataFrame

Returns

The generated data.

Raises

RuntimeError – if num_rows of data can’t be generated within max_iter iterations.