Rebalancing

The full source code for this example is available for download here.

Prerequisites

This tutorial assumes that you have already installed the Synthesized package and have an understanding how to use the tabular synthesizer. If you are new to Synthesized, we recommend you start with the quickstart guide and/or single table synthesis tutorial before jumping into this tutorial.

Introduction

In this tutorial we will demonstrate how to the SDK can be used to alter the distributions of a dataset. This is useful if the user wants to reshape their data for a specific purpose. For example, the user may have a dataset with a target column that is extremely imbalanced. The user may want to train a classification model to try and predict the value of the target column, but because of the imbalance in the target column, the model may not perform well. Using the technique of data rebalancing, the dataset can be reshaped to improve the performance of the classification model.

In this tutorial we will walk through an explicit example where data rebalancing is used to improve the performance of a classification model trained on an originally highly imbalanced dataset.

For more information on the techniques used in this tutorial, and an in-depth discussion on reducing data bias using these techniques, see our blog post or the documentation.

Credit Dataset

In this tutorial we will use a public credit scoring dataset from Kaggle, also available with the synthesized_datasets package:

import synthesized_datasets
import pandas as pd

df_orig = synthesized_datasets.CREDIT.credit.load()
df_orig
        SeriousDlqin2yrs  RevolvingUtilizationOfUnsecuredLines  age  ...  NumberOfTime60-89DaysPastDueNotWorse  NumberOfDependents
0                      1                              0.766127   45  ...                                     0                 2.0
1                      0                              0.957151   40  ...                                     0                 1.0
2                      0                              0.658180   38  ...                                     0                 0.0
3                      0                              0.233810   30  ...                                     0                 0.0
4                      0                              0.907239   49  ...                                     0                 0.0
...                  ...                                   ...  ...  ...                                   ...                 ...
149995                 0                              0.040674   74  ...                                     0                 0.0
149996                 0                              0.299745   44  ...                                     0                 2.0
149997                 0                              0.246044   58  ...                                     0                 0.0
149998                 0                              0.000000   30  ...                                     0                 0.0
149999                 0                              0.850283   64  ...                                     0                 0.0

[150000 rows x 11 columns]

The binary classification column "SeriousDlqin2yrs", denoting whether someone has defaulted on a loan within the last 2 years, will be the target variable. The remaining columns will be explanatory variables that will be used to train a classification model.

y_label = "SeriousDlqin2yrs"
x_labels = [col for col in df_orig.columns if col != y_label]

The target column is highly imbalanced. This can be seen by looking at the value counts of the target column:

value_counts = df_orig[y_label].value_counts()
value_counts
0    139974
1     10026
Name: SeriousDlqin2yrs, dtype: int64

Plotting the value counts of the target column shows the imbalance more clearly:

pd.cut(df_orig[y_label],
       bins=[-0.5, 0.75, 1],
       labels = ['0','1'])\
  .value_counts(sort=False).plot.bar()
Skewed credit dataset

We see that approximately 93% of the rows in the dataset have a value of 0 for the target column, and only 7% have a value of 1.

Training a Linear Classification Model

In the following, a linear RidgeClassifier model will be used to try and predict the value of the target variable SeriousDlqin2yrs. The remainder of the columns are used as explanatory variables. The test-train-split technique will be used to evaluate the performance of the model in that task.

Before fitting the RidgeClassifier model some preprocessing needs to be applied to the dataset. The SDK offers the preprocess() convenience function that will be used to preprocess the dataset. The preprocessing will label, or one-hot encode the categorical columns, and transform the continuous columns using a StandardScaler.

from synthesized.insight.modelling import ModellingPreprocessor

def preprocess(
    df: pd.DataFrame,
    preprocessor: ModellingPreprocessor
):
    df_processed = preprocessor.transform(df)
    y = df_processed.pop(preprocessor.target).to_numpy()
    x = df_processed.to_numpy()
    return x, y

In the first instance, the RidgeClassifier will be trained on the original data:

from sklearn.model_selection import train_test_split

test_size = 0.2

df_train, df_test = train_test_split(
    df_orig,
    test_size=test_size,
    stratify=df_orig[y_label],
    random_state=42,
)

preprocessor = ModellingPreprocessor(target=y_label)
preprocessor.fit(df_orig)

x_train, y_train = preprocess(df_train, preprocessor)
x_test, y_test = preprocess(df_test, preprocessor)

The RidgeClassifier model will be fitted using the train subset of the data:

from sklearn.linear_model import RidgeClassifier
orig_classifier = RidgeClassifier()
orig_classifier.fit(x_train, y_train)

The ability of the model to classify unseen data will then be evaluated using the test subset:

y_predict = orig_classifier.predict(x_test)

The area under the ROC-curve (ROC-AUC) is used to evaluate the performance of the model in classifying the unseen test data. The value of the ROC-AUC varies between 0 and 1, with 1 implying perfect separability between the two classes , and 0 implying the model predicts the exact incorrect class for each row. A value of 0.5 means that the model hasn’t learnt any difference between the classes, and has no predictive capacity.

from sklearn.metrics import roc_auc_score
orig_roc_auc = roc_auc_score(y_test, y_predict)
orig_roc_auc
0.5703505568994107

The value of ~0.57 implies that the model has some predictive capacity, but not much better than a random guess.

Using Rebalanced Synthetic Data

The performance of the linear RidgeClassifier model in the previous section was, on the whole, pretty poor. The reasoning can be traced back to the data used to train the model: the high imbalance in the target column means that the class 1 in the target column has a very faint signal.

A naive solution to this problem is to create a new dataset by oversampling data from the minority class and undersampling from the majority in order to achieve the desired distribution of classes. However, there is a very clear drawback to this method in that the new dataset may be significantly smaller than the original. Using this traditional technique, the issue has been transformed from not having enough quality data to potentially not having enough data at all.

More advanced methods, such as SMOTE, create entirely new data points for the minority class to augment the original dataset with. The downside of SMOTE is that there is no understanding of the statistics of the original data meaning that the correlations between variables is lost, degrading model performance.

Alternatively, the deep generative models utilised in the HighDimSynthesizer can be used to learn the statistical properties and correlations present in the original data and synthesize a dataset containing columns adhering to user defined distributions.

To generate the synthetic data we first create a HighDimSynthesizer instance using the meta data extracted from the original dataset:

from synthesized import HighDimSynthesizer, MetaExtractor
df_meta = MetaExtractor.extract(df_orig)
synth = HighDimSynthesizer(df_meta)

The HighDimSynthesizer instance is then trained using the train subset of the original that was defined above - the test subset is held back to prevent any possible data leaks.

synth.learn(df_train)

Rather than simply calling the HighDimSynthesizer.synthesize() method, the ConditionalSampler class can be used to generate completely new, synthetic data where the proportions of the two classes in the target variable have been rebalanced to occur in equal proportions.

An instance of the ConditionalSampler can be created by passing in a trained instance of the HighDimSynthesizer. To generate rebalanced data, the desired distributions of the classes in the target column are specified using the explicit_marginals argument of the sample() method:

from synthesized import ConditionalSampler
sampler = ConditionalSampler(synth)
explicit_marginals = {y_label: [(0, 0.5), (1, 0.5)]}
df_synth = sampler.sample(
    num_rows=len(df_train),
    explicit_marginals=explicit_marginals,
    max_trials=40
)

To verify that the distribution of the target column has indeed been rebalanced, the Assessor module can be used to visually inspect the distributions of the variables in the synthetic data compared to the original.

from synthesized.testing import Assessor
assessor = Assessor(df_meta)
assessor.show_distributions(df_train, df_synth);
Distributions of the rebalanced synthetic and original datasets

As demonstrated in the distributions, the target variable has been rebalanced such that the classes now appear in equal proportions (50% each). The rebalancing of the target variable has also had an effect on the distributions of the other variables in the dataset. This is because correlations between variables are learnt by the HighDimSynthesizer and are used to generate the rebalanced synthetic data. For example, the peak of the age distribution has been shifted to the left, implying that younger individuals may be more likely to default.

The training of a RidgeClassifier model can now be conducted using the rebalanced synthetic dataset as the training data. For fairness of comparison, the same test dataset will be used to evaluate the performance of this model as was used when evaluating the performance of the model trained with original data. It is important to note that synthetic data should never be used as test data and should only be used when training the model of interest.

preprocessor = ModellingPreprocessor(target=y_label)
preprocessor.fit(pd.concat([df_orig, df_synth]))

x_synth, y_synth = preprocess(df_synth, preprocessor)

synth_classifier = RidgeClassifier()
synth_classifier.fit(x_synth, y_synth)

y_predict = synth_classifier.predict(x_test)

Again, using the ROC-AUC as a metric of model performance, we see an enormous increase in the new models ability to distinguishing the two classes when compared with the original model:

synth_roc_auc = roc_auc_score(y_test, y_predict)
synth_roc_auc
0.7306853423683158

Because synthetic data in the SDK is generated from random noise, if new set of synthetic data were generated, and a linear model is trained with the resulting dataset we would observe that the ROC-AUC would fluctuate around a mean value.

synth_roc_auc / orig_roc_auc
1.2811162074435958

By comparing the ROC-AUC of the model trained with synthetic data to the model trained with original data, we see that the new model displays a 25 - 30% improvement in performance.

Bootstrapping Original Data

In the above, we used rebalanced, purely synthetic data to train a linear classification model. However, the ConditionalSampler offers an alternative means to generate rebalanced data by augmenting the original data with synthetic data composed of the minority class using the alter_distributions() method.

The same ConditionalSampler instance created above will be used, but we will configure it to generate a dataset containing a mixture of real and synthetic data.

When creating such a dataset, care must be taken when performing the test/train split. As mentioned above, it is important to ensure that no synthetic data is used to test the model, therefore only the distributions of the train dataset should be altered.

explicit_marginals = {y_label: [(0, 0.5), (1, 0.5)]}
df_altered = sampler.alter_distributions(
    df_train,
    num_rows=len(df_train),
    explicit_marginals=explicit_marginals,
    max_trials=40
)

x_altered, y_altered = preprocess(df_altered, preprocessor)

altered_classifier = RidgeClassifier()
altered_classifier.fit(x_altered, y_altered)

y_predict = altered_classifier.predict(x_test)

altered_roc_auc = roc_auc_score(y_test, y_predict)
altered_roc_auc
0.7356642328809161

Using bootstrapped data, we see a very similar ROC-AUC score to when we trained the model with purely synthetic data.