Data Rebalancing
The basic use of HighDimSynthesizer
is to generate a synthetic version of a
dataset that preserves the statistical properties, as described in the
Single Table Synthesis Guide. With
ConditionalSampler
the user has the capability to generate a new dataset
with user-defined marginal distributions, while still keeping the rest of the
statistical properties as close as possible to those in the original dataset.
This conditional sampling technique can be used in many situations, for example, to upsample rare events and improve a model’s predictive performance in highly imbalanced datasets, or to generate custom scenarios to validate proper system behaviour.
A more extended analysis on the different applications of data rebalancing and augmentation using Synthesized can be obtained from our web page: Data Science Applications of the Synthesized Platform |
Similarly, rules can be defined in Synthesized to generate data that corresponds to a custom scenario – See our Rules Guide.
Conditional Sampling
The ConditionalSampler
class allows its user to specify marginal
distributions for certain columns. The ConditionalSampler
then guarantees
that the data that is generated obeys these distributions. This requires a
HighDimSynthesizer
instance that has already been trained on the desired
dataset.
from synthesized import ConditionalSampler
sampler = ConditionalSampler(synthesizer) (1)
1 | synthesizer is an HighDimSynthesizer instance |
The desired conditions to generate are specified by a marginal distribution for
the desired columns. This takes the form of a list of tuples, each specifying a
category and a probability. The sum of the probabilities in these tuple must be
1 to define a proper distribution. For example, consider a transaction dataset
with a categorical transaction_flag
field that has two categories:
fraud
, not-fraud
, which contains only 5% of fraud
transactions. A
machine learning model trained on a dataset like this could lead to unexpected
results if the target imbalance is not treated carefully.
This problem could be easily solve by upsampling the minority class to obtain a
new dataset with 50% fraud
and 50% not-fraud
samples. To do so with
ConditionalSampler
, the desired marginal distribution can be specified as a
list
with elements of the form (category[str], probability[float])
.
fraud_marginal = [("fraud", 0.5), ("not-fraud", 0.5)]
Fraud columns are often simply booleans with values of 1 and 0 to represent |
And then generate the new dataset with the previously initialized
ConditionalSampler
, specifying the column name transaction_flag
:
sampler.synthesize(
num_rows=100,
explicit_marginals={'transaction_flag': fraud_marginal}
)
Hence, the new dataset can also be generated using the following command:
sampler.sample(
num_rows=100,
explicit_marginals={'transaction_flag': fraud_marginal}
)
It is also possible to generate only |
To specify conditions for continuous fields, the categories must be given as
bin edges of the form: "[low, high)"
, e.g.,
age_marginal = [("<= 18", 0.0), ("19 - 25", 0.2), ("26 - 35", 0.2), ("36 - 45", 0.6)]
sampler.synthesize(
num_rows=100,
explicit_marginals={'age': age_marginal}
)
With the marginal distributions defined, they can be passed to the
explicit_marginals
parameter of ConditionalSampler.synthesize
(or
ConditionalSampler.sample
) to generate the desired data.
sampler.synthesize(
num_rows=100,
explicit_marginals={
'transaction_flag': fraud_marginal,
'age': age_marginal
}
)
It’s important to correctly define the Additionally, all values in a |
Alter Distributions
With conditional sampling, the output dataset is fully synthetic and doesn’t contain any sample from the original dataset. But it is also possible to alter the distributions of a given dataset, and obtain a new dataset with a specific size, desired marginal distributions, and that contains a mix of Synthesized and original data.
This is achieved with the ConditionalSampler.alter_distributions()
method:
sampler.alter_distributions(
df=df_original,
num_rows=1000,
explicit_marginals={
'transaction_flag': fraud_marginal,
'age': age_marginal
}
)
For a full end-to-end example of how the ConditionalSampler
can be used when dealing with minority datasets, see
the Rebalancing tutorial.