ConditionalSampler#

synthesized.ConditionalSampler

class ConditionalSampler(synthesizer, min_sampled_ratio=0.001, synthesis_batch_size=65536)#

Generate conditional data that corresponds to user-defined marginal distributions of specific columns.

Allows for reshaping and of the original data distribution by defining custom marginal distributions for each column. This can be used for instance to upsample outlier data points and correct class imbalance.

Parameters
  • synthesizer (Synthesizer) – Sample from this trained Synthesizer instance.

  • min_sampled_ratio (float, optional) – Stop synthesis if the ratio of successfully sampled records is less than this value. Defaults to 0.001.

  • synthesis_batch_size (int, optional) – Generate data in batches of this size. Defaults to 65536. Larger values may speed up the generation process.

Example

Initialize from a HighDimSynthesizer instance:

>>> cond = ConditionalSampler(synthesizer)

Define new marginal distributions for the age and SeriousDlqin2yrs columns:

>>> marginals = {'SeriousDlqin2yrs': [('0', 0.3), ('1', 0.7)],
>>>              'age': [('[0.0, 50.0)', 0.5), ('[50.0, 100.0)', 0.5)]}

Synthesize 100 new rows that follow the defined marginal distributions:

>>> cond.synthesize(num_rows=100, explicit_marginals=marginals))

Methods

alter_distributions(df, num_rows[, ...])

Given a DataFrame, drop and/or generate new samples so that the column distributions are defined by user-specified marginals distributions.

fit(df_train, num_iterations[, callback, ...])

Train the generative model for the given iterations.

learn(df_train, num_iterations[, callback, ...])

Train the generative model for the given iterations.

sample(num_rows[, produce_nans, ...])

Generate the given number of new data rows according to user-defined marginal distributions.

synthesize(num_rows[, produce_nans, ...])

Generate the given number of new data rows according to user-defined marginal distributions.

alter_distributions(df, num_rows, produce_nans=False, explicit_marginals=None, association_rules=None, expression_rules=None, generic_rules=None, progress_callback=None)#

Given a DataFrame, drop and/or generate new samples so that the column distributions are defined by user-specified marginals distributions. Unlike the ConditionalSampler.synthesize() method, this will keep some of the original data, and therefore the output will not be purely synthetic data.

Parameters
  • df (pd.DataFrame) – DataFrame of original data to modify.

  • num_rows (int) – The number of rows to generate.

  • produce_nans (bool) – Whether to produce NaNs. Defaults to False.

  • explicit_marginals (List[Dict[str, Dict[Union[str, int, float], float]]]) – Desired marginal distributions per column, defined as probably density per category or bin.

  • association_rules (List[Association]) – A list of association rules to apply to the data.

  • expression_rules (List[Expression]) – list of expression rules to apply to the data.

  • generic_rules (List[GenericRule]) – list of generic rules to apply to the data.

  • progress_callback (Callable, optional) – Progress bar callback. Defaults to None.

Return type

DataFrame

Returns

The generated data.

See also

ConditionalSampler.synthesize() : Generate synthetic data from user-specified marginal distributions.

fit(df_train, num_iterations, callback=None, callback_freq=0)#

Train the generative model for the given iterations.

Repeated calls continue training the model, possibly on different data.

Parameters
  • df_train (DataFrame) – The training data.

  • num_iterations (Optional[int]) – The number of training iterations (not epochs).

  • callback (Optional[Callable[[object, int, dict], bool]]) – A callback function, e.g. for logging purposes. Takes the synthesizer instance, the iteration number, and a dictionary of values (usually the losses) as arguments. Aborts training if the return value is True.

  • callback_freq (int) – Callback frequency.

Return type

None

learn(df_train, num_iterations, callback=None, callback_freq=0)#

Train the generative model for the given iterations.

Repeated calls continue training the model, possibly on different data.

Parameters
  • df_train (DataFrame) – The training data.

  • num_iterations (Optional[int]) – The number of training iterations (not epochs).

  • callback (Optional[Callable[[object, int, dict], bool]]) – A callback function, e.g. for logging purposes. Takes the synthesizer instance, the iteration number, and a dictionary of values (usually the losses) as arguments. Aborts training if the return value is True.

  • callback_freq (int) – Callback frequency.

Return type

None

sample(num_rows, produce_nans=False, progress_callback=None, explicit_marginals=None, association_rules=None, expression_rules=None, generic_rules=None)#

Generate the given number of new data rows according to user-defined marginal distributions.

Custom distributions for each column can be specified using a dictionary structure, e.g {'binary_column_name': [(0, 0.5), (1, 0.5)]} would produce a dataset where the binary_column_name feature has a uniform distribution across the two categories 0 and 1.

For continuous features, the marginal distribution must be specified in terms of non-overlapping bins, e.g {'continuous_column_name': [('[0.0, 50.0)', 0.5), ('[50.0, 100.0)', 0.5)]}. Each bin is defined by a string representation of the left and high right edges. See the examples for details.

Parameters
  • num_rows (int) – The number of rows to generate.

  • produce_nans (bool) – Whether to produce NaNs. Defaults to False.

  • progress_callback (Callable, optional) – Progress bar callback.

  • explicit_marginals (Dict[str, Dict[str, float]], optional) – Desired marginal distributions per column, defined as probably density per category or bin. Defaults to None.

  • association_rules (List[Association]) – A list of association rules to apply to the data.

  • expression_rules (List[Expression]) – List of expression rules to apply to the data.

  • generic_rules (List[GenericRule]) – List of generic rules to apply to the data.

Return type

DataFrame

Returns

The generated data.

Examples

Correct the class balance of a column with a severe class imbalance by defining a marginal distribution with a uniform distribution over the categories:

>>> marginals = {'category': [('0', 0.5), ('1', 0.5)]}
>>> cond = ConditonalSampler(synthesizer)

Generate 100 rows of data in which there will be approximately a uniform distribution in the category column:

>>> cond.synthesize(num_rows=100, explicit_marginals=marginals)

See also

ConditionalSampler.alter_distributions() :

Adjust distributions of the original data with user-specified marginal distributions.

synthesize(num_rows, produce_nans=False, progress_callback=None, explicit_marginals=None, association_rules=None, expression_rules=None, generic_rules=None)#

Generate the given number of new data rows according to user-defined marginal distributions.

Custom distributions for each column can be specified using a dictionary structure, e.g {'binary_column_name': [(0, 0.5), (1, 0.5)]} would produce a dataset where the binary_column_name feature has a uniform distribution across the two categories 0 and 1.

For continuous features, the marginal distribution must be specified in terms of non-overlapping bins, e.g {'continuous_column_name': [('[0.0, 50.0)', 0.5), ('[50.0, 100.0)', 0.5)]}. Each bin is defined by a string representation of the left and high right edges. See the examples for details.

Parameters
  • num_rows (int) – The number of rows to generate.

  • produce_nans (bool) – Whether to produce NaNs. Defaults to False.

  • progress_callback (Callable, optional) – Progress bar callback.

  • explicit_marginals (Dict[str, Dict[str, float]], optional) – Desired marginal distributions per column, defined as probably density per category or bin. Defaults to None.

  • association_rules (List[Association]) – A list of association rules to apply to the data.

  • expression_rules (List[Expression]) – List of expression rules to apply to the data.

  • generic_rules (List[GenericRule]) – List of generic rules to apply to the data.

Return type

DataFrame

Returns

The generated data.

Examples

Correct the class balance of a column with a severe class imbalance by defining a marginal distribution with a uniform distribution over the categories:

>>> marginals = {'category': [('0', 0.5), ('1', 0.5)]}
>>> cond = ConditonalSampler(synthesizer)

Generate 100 rows of data in which there will be approximately a uniform distribution in the category column:

>>> cond.synthesize(num_rows=100, explicit_marginals=marginals)

See also

ConditionalSampler.alter_distributions() :

Adjust distributions of the original data with user-specified marginal distributions.