Synthesized is capable of generating new data points that match a specific condition, e.g (transaction_flag == fraud), in addition to generating data that follows a completely custom distribution. This conditional sampling can be used, for example, to generate data that follows a custom scenario, or upsample rare events to improve class imbalance or study outliers.

Alternatively, rules can be defined in Synthesized to generate data that corresponds to a custom scenario: See the rules user guide.

Conditional Sampling

In [1]: from synthesized import ConditionalSampler

Conditional sampling is achieved using the ConditionalSampler class. This requires a HighDimSynthesizer instance that has already been learned on the desired dataset.

In [2]: sampler = ConditionalSampler(synthesizer) # synthesizer is an HighDimSynthesizer instance

The desired conditions to generate are specified by a normalized marginal distribution for the desired columns. For example, consider a transaction dataset with a categorical transaction_flag field that has two categories: fraud, not-fraud. To generate a new dataset with 25% fraud and 75% not fraud, the desired marginal distribution can be specified as a dict with specification {category[str]: normalised_probablity[float]}

In [3]: fraud_marginal = {"fraud": 0.25, "not-fraud": 0.75}

Alternatively, to generate only fraud transactions the marginal specification would be

In [4]: fraud_marginal = {"fraud": 1.0, "not-fraud": 0.0}

To specify conditions for continuous fields, the categories must be given as bin edges of the form e.g "[low, high)" , e.g

In [5]: age_marginal = {'[0.0, 50.0)': 0.5, '[50.0, 100.0)': 0.5}

With the marginal distributions defined, they can be passed to the explicit_marginals parameter of synthesize() to generate the desired data.

In [6]: sampler.synthesize(
   ...:     num_rows=100,
   ...:     explicit_marginals={'transaction_flag': transaction_marginal, 'age': age_marginal}
   ...:     )