Data Rebalancing#

The basic use of HighDimSynthesizer is to generate a synthetic version of a dataset that preserves the statistical properties, as described in the single table synthesis guide.. With ConditionalSampler the user has the capability to generate a new dataset with user-defined marginal distributions, while still keeping the rest of the statistical properties as close as possible to those in the original dataset.

This conditional sampling technique can be used in many situations, for example, to upsample rare events and improve a model’s predictive performance in highly imbalanced datasets, or to generate custom scenarios to validate proper system behaviour.

Note

A more extended analysis on the different applications of data rebalancing and augmentation using Synthesized can be obtained from our web page: Data Science Applications of the Synthesized Platform

Similarly, rules can be defined in Synthesized to generate data that corresponds to a custom scenario: See the rules user guide.

Conditional Sampling#

The ConditionalSampler class allows its user to specify marginal distributions for certain columns. The ConditionalSampler then guarantees that the data that is generated obeys these distributions. This requires a HighDimSynthesizer instance that has already been trained on the desired dataset.

In [1]: from synthesized import ConditionalSampler

In [2]: sampler = ConditionalSampler(synthesizer) # synthesizer is an HighDimSynthesizer instance

The desired conditions to generate are specified by a marginal distribution for the desired columns. This takes the form of a list of tuples, each specifying a category and a probability. The sum of the probabilities in these tuple must be 1 to define a proper distribution. For example, consider a transaction dataset with a categorical transaction_flag field that has two categories: fraud, not-fraud, which contains only 5% of fraud transactions. A machine learning model trained on a dataset like this could lead to unexpected results if the target imbalance is not treated carefully.

This problem could be easily solve by upsampling the minority class to obtain a new dataset with 50% fraud and 50% not fraud samples. To do so with ConditionalSampler, the desired marginal distribution can be specified as a list with elements of the form (category[str], probablity[float])

In [3]: fraud_marginal = [("fraud", 0.5), ("not-fraud", 0.5)]

And then generate the new dataset with the previously initialized ConditionalSampler, specifying the column name transaction_flag:

In [4]: sampler.synthesize(
   ...:     num_rows=100,
   ...:     explicit_marginals={'transaction_flag': fraud_marginal}
   ...: )
   ...: 

Hence, the new dataset can also be generated using the following command:

In [5]: sampler.sample(
   ...:     num_rows=100,
   ...:     explicit_marginals={'transaction_flag': fraud_marginal}
   ...: )
   ...: 

Note

It is also possible to generate only fraud transactions. The marginal specification would then be fraud_marginal = [("fraud": 1.0), ("not-fraud": 0.0)].

To specify conditions for continuous fields, the categories must be given as bin edges of the form e.g "[low, high)" , e.g

In [6]: age_marginal = [('[0.0, 50.0)', 0.3), ('[50.0, 100.0)', 0.7)}
   ...: sampler.synthesize(
   ...:     num_rows=100,
   ...:     explicit_marginals={'age': age_marginal}
   ...: )
   ...: 

With the marginal distributions defined, they can be passed to the explicit_marginals parameter of synthesize() (or sample()) to generate the desired data.

In [7]: sampler.synthesize(
   ...:     num_rows=100,
   ...:     explicit_marginals={
   ...:         'transaction_flag': transaction_marginal,
   ...:         'age': age_marginal
   ...:     }
   ...: )
   ...: 

Warning

It’s important to correctly define the explicit_marginals argument, otherwise the ConditionalSampler will raise a ValueError. This dictionary must contain a dictionary Dict[column_name, marginal], where marginal has the format List[Tuple[value, str]], where value can be a str/float/int denoting the category/interval, and the values contain the probability of that category/interval.

Additionally, all values in a marginal must add up to 1.

Alter Distributions#

With conditional sampling, the output dataset is fully synthetic and doesn’t contain any sample from the original dataset. But it is also possible to alter the distributions of a given dataset, and obtain a new dataset with a specific size, desired marginal distributions, and that contains a mix of Synthesized and original data.

This is achieved with the ConditionalSampler.alter_distributions() method:

In [8]: sampler.alter_distributions(
   ...:     df=df_original,
   ...:     num_rows=1000,
   ...:     explicit_marginals={
   ...:         'transaction_flag': transaction_marginal,
   ...:         'age': age_marginal
   ...:     }
   ...: )
   ...: