Rebalancing
The full source code for this example is available for download here. |
In this tutorial we will demonstrate how to alter the distributions of a highly imbalanced dataset, using the technique of data rebalancing, in order to improve the performance of a classification model on an underrepresented group.
For more information on the techniques used in this tutorial, and an in-depth discussion on reducing data bias using these techniques, see our blog post or the documentation.
Credit Dataset
In this tutorial we will use a public credit scoring dataset from Kaggle,
also available with the synthesized_datasets
package:
import synthesized_datasets
import pandas as pd
df_orig = synthesized_datasets.CREDIT.credit.load()
df_orig
SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age ... NumberOfTime60-89DaysPastDueNotWorse NumberOfDependents
0 1 0.766127 45 ... 0 2.0
1 0 0.957151 40 ... 0 1.0
2 0 0.658180 38 ... 0 0.0
3 0 0.233810 30 ... 0 0.0
4 0 0.907239 49 ... 0 0.0
... ... ... ... ... ... ...
149995 0 0.040674 74 ... 0 0.0
149996 0 0.299745 44 ... 0 2.0
149997 0 0.246044 58 ... 0 0.0
149998 0 0.000000 30 ... 0 0.0
149999 0 0.850283 64 ... 0 0.0
[150000 rows x 11 columns]
The binary classification column "SeriousDlqin2yrs", denoting whether someone has defaulted on a loan within the last 2 years, will be the target variable while the remaining columns will be explanatory variables that will be used to train a classification model.
y_label = "SeriousDlqin2yrs"
x_labels = [col for col in df_orig.columns if col != y_label]
The target column is highly skewed resulting in a highly unbalanced dataset.
value_counts = df_orig[y_label].value_counts()
value_counts
>>> 0 139974
>>> 1 10026
>>> Name: SeriousDlqin2yrs, dtype: int64
pd.cut(df_orig[y_label],
bins=[-0.5, 0.75, 1],
labels = ['0','1'])\
.value_counts(sort=False).plot.bar()

Training a Linear Classification Model
In the following, a linear RidgeClassifier
model will be used to try and predict the value of the target variable
(SeriousDlqin2yrs
, i.e. whether a person will default) using the remainder of the columns as explanatory variables.
The test-train-split technique will be used to evaluate the performance of the model in that task.
The preprocess()
convenience function will be used to preprocess the data, used to train the RidgeClassifier
model,
using a fitted instance of the ModellingPreprocessor
class available in the Synthesized SDK. The preprocessing will
label or one-hot encode the categorical columns and transform the continuous columns using a StandardScaler
.
from synthesized.insight.modelling import ModellingPreprocessor
def preprocess(
df: pd.DataFrame,
preprocessor: ModellingPreprocessor
):
df_processed = preprocessor.transform(df)
y = df_processed.pop(preprocessor.target).to_numpy()
x = df_processed.to_numpy()
return x, y
In the first instance, the RidgeClassifier
will be trained on the original data:
from sklearn.model_selection import train_test_split
test_size = 0.2
df_train, df_test = train_test_split(
df_orig,
test_size=test_size,
stratify=df_orig[y_label],
random_state=42,
)
preprocessor = ModellingPreprocessor(target=y_label)
preprocessor.fit(df_orig)
x_train, y_train = preprocess(df_train, preprocessor)
x_test, y_test = preprocess(df_test, preprocessor)
The RidgeClassifier
model will be fitted using the train subset of the data. The ability of the model to classify
unseen data will then be evaluated using the test subset:
from sklearn.linear_model import RidgeClassifier
orig_classifier = RidgeClassifier()
orig_classifier.fit(x_train, y_train)
y_predict = orig_classifier.predict(x_test)
The area under the ROC-curve (AUC-ROC) is used as a means to evaluate the quality of the model in classifying unseen test data. The value of the AUC-ROC varies between 0 and 1, with 1 implying perfect separability between the two classes and 0 implying the exact opposite, i.e. the model is predicting 0’s as 1’s and 1’s as 0’s in our case. A value of 0.5 means that the model hasn’t learnt any difference between the classes at all and has no predictive capacity.
from sklearn.metrics import roc_auc_score
orig_roc_auc = roc_auc_score(y_test, y_predict)
orig_roc_auc
>>> 0.5703505568994107
The value of ~0.57 implies that the model has some predictive capacity, but not much better than a random guess.
Using Rebalanced Synthetic Data
The performance of the linear RidgeClassifier
model in the previous section was, on the whole, pretty poor. The
substandard results can be traced back to the data used to train the model in the first place - the highly imbalanced
nature of the dataset means that there is only a very faint signal from the target variable.
A naive solution to this problem is to create a new dataset by oversampling data from the minority class and undersampling from the majority in order to achieve the desired distribution of classes. However, there is a very clear drawback to this method in that the new dataset may be significantly smaller than the original. Using this traditional technique, the issue has been transformed from not having enough quality data to potentially not having enough data at all.
More advanced methods, such as SMOTE, create entirely new data points for the minority class to augment the original dataset with. The downside of SMOTE is that there is no understanding of the statistics of the original data meaning that the correlations between variables is lost, degrading model performance.
Alternatively, the deep generative models utilised in the HighDimSynthesizer
can be used to learn the statistical
properties and correlations present in the original data and synthesize a dataset containing columns adhering to user
defined distributions.
To generate the synthetic data we first create a HighDimSynthesizer
instance using the meta data extracted from the
original dataset:
from synthesized import HighDimSynthesizer, MetaExtractor
df_meta = MetaExtractor.extract(df_orig)
synth = HighDimSynthesizer(df_meta)
The HighDimSynthesizer
instance is then trained using the train subset of the original that was defined above - the
test subset is held back to prevent any possible data leaks.
synth.learn(df_train)
Rather than simply calling the HighDimSynthesizer.synthesize()
method, the ConditionalSampler
class can be used
to generate completely new, synthetic data where the proportions of the two classes in the target variable have been
rebalanced to occur in equal proportions.
An instance of the ConditionalSampler
can be created by passing in a trained instance of the HighDimSynthesizer
.
To generate rebalanced data, the desired distributions of the classes in the target column are specified using the
explicit_marginals
argument of the sample()
method:
from synthesized import ConditionalSampler
sampler = ConditionalSampler(synth)
explicit_marginals = {y_label: [(0, 0.5), (1, 0.5)]}
df_synth = sampler.sample(
num_rows=len(df_train),
explicit_marginals=explicit_marginals,
max_trials=40
)
As a quick sanity check and to verify that the distribution of the target column has indeed been rebalanced, the
Assessor
module can be utilised in order to visually inspect the distributions of the continuous and categorical
variables in the synthetic data compared to the original.
from synthesized.testing import Assessor
assessor = Assessor(df_meta)
assessor.show_distributions(df_train, df_synth);

As demonstrated in the distributions, the target variable has been rebalanced such that the classes now appear in a 50/50 split.
Due to the synthetic dataset having a greater proportion of the minority class than the original, some distributions have
been changed. For example, the peak of the age
distribution has been shifted to the left implying that younger
individuals may be more likely to default.
The training of a RidgeClassifier
model can now be conducted using the rebalanced synthetic dataset as the training data.
For fairness of comparison, the same test dataset will be used to evaluate the performance of this model as was used when
evaluating the performance of the model trained with original data. It is important to note that synthetic data should never
be used as test data and should only be used when training the model of interest.
preprocessor = ModellingPreprocessor(target=y_label)
preprocessor.fit(pd.concat([df_orig, df_synth]))
x_synth, y_synth = preprocess(df_synth, preprocessor)
synth_classifier = RidgeClassifier()
synth_classifier.fit(x_synth, y_synth)
y_predict = synth_classifier.predict(x_test)
Using the AUC-ROC as a metric of model performance, the model trained with rebalanced data has demonstrated an enormous performance in distinguishing the two classes over the model trained with original data:
synth_roc_auc = roc_auc_score(y_test, y_predict)
synth_roc_auc
>>> 0.7306853423683158
synth_roc_auc / orig_roc_auc
>>> 1.2811162074435958
In general, if new set of synthetic data were generated, using the same ConditionalSampler
instance, and a linear model
trained with the resulting dataset we would observe that the ROC-AUC would fluctuate around a mean value. This non-deterministic
behaviour is entirely deliberate and is due to the careful injection of noise at various stages of the HighDimSynthesizer
training in order to ensure that there is no one-to-one mapping between any row in the synthetic data and any row in the
original. Data anonymization is a key benefit of synthetic data over traditional techniques. For more information
on synthetic data privacy, see the documentation.
While we would expect the precise value of synth_roc_auc
to fluctuate around a mean, in this specific example the
linear classifier model trained using synthetic data demonstrates a 25-30% improvement over the same model trained
with original data.
Bootstrapping Original Data
In the above, we used rebalanced, purely synthetic data to train a linear classification model. However, the
ConditionalSampler
offers an alternative means to generate rebalanced data by augmenting the original data with
synthetic data composed of the minority class using the alter_distributions()
method.
The same ConditionalSampler
instance created above will be utilised, but now to generate a DataFrame containing a
mixture of real and synthetic data in the correct proportions such that we achieve the desired distribution of values.
When creating such a dataset, care must be taken when performing the test/train split. As mentioned above, it is important to ensure that no
synthetic data is in the test portion (as it is desired to evaluate the performance of the linear classifier rather than
the HighDimSynthesizer
), so only the distributions of the test dataset should be altered.
explicit_marginals = {y_label: [(0, 0.5), (1, 0.5)]}
df_altered = sampler.alter_distributions(
df_train,
num_rows=len(df_train),
explicit_marginals=explicit_marginals,
max_trials=40
)
x_altered, y_altered = preprocess(df_altered, preprocessor)
altered_classifier = RidgeClassifier()
altered_classifier.fit(x_altered, y_altered)
y_predict = altered_classifier.predict(x_test)
altered_roc_auc = roc_auc_score(y_test, y_predict)
altered_roc_auc
>>> 0.7356642328809161
Using bootstrapped data, a very similar improvement in the performance of a linear classification model is observed as when using purely synthetic data.