ConditionalSampler.synthesize

ConditionalSampler.synthesize(num_rows, produce_nans=False, progress_callback=None, explicit_marginals=None, association_rules=None, expression_rules=None, generic_rules=None)

Generate the given number of new data rows according to user-defined marginal distributions.

Custom distributions for each column can be specified using a dictionary structure, e.g {'binary_column_name': [(0, 0.5), (1, 0.5)]} would produce a dataset where the binary_column_name feature has a uniform distribution across the two categories 0 and 1.

For continuous features, the marginal distribution must be specified in terms of non-overlapping bins, e.g {'continuous_column_name': [('[0.0, 50.0)', 0.5), ('[50.0, 100.0)', 0.5)]}. Each bin is defined by a string representation of the left and high right edges. See the examples for details.

Parameters
  • num_rows (int) – The number of rows to generate.

  • produce_nans (bool) – Whether to produce NaNs. Defaults to False.

  • progress_callback (Callable, optional) – Progress bar callback.

  • explicit_marginals (Dict[str, Dict[str, float]], optional) – Desired marginal distributions per column, defined as probably density per category or bin. Defaults to None.

  • association_rules (List[Association]) – A list of association rules to apply to the data.

  • expression_rules (List[Expression]) – List of expression rules to apply to the data.

  • generic_rules (List[GenericRule]) – List of generic rules to apply to the data.

Return type

DataFrame

Returns

The generated data.

Examples

Correct the class balance of a column with a severe class imbalance by defining a marginal distribution with a uniform distribution over the categories:

>>> marginals = {'category': [('0', 0.5), ('1', 0.5)]}
>>> cond = ConditonalSampler(synthesizer)

Generate 100 rows of data in which there will be approximately a uniform distribution in the category column:

>>> cond.synthesize(num_rows=100, explicit_marginals=marginals)

See also

ConditionalSampler.alter_distributions() :

Adjust distributions of the original data with user-specified marginal distributions.