Rebalancing

The full source code for this example is available for download here.

In this tutorial we will demonstrate how to alter the distributions of a highly imbalanced dataset, using the technique of data rebalancing, in order to improve the performance of a classification model on an underrepresented group.

For more information on the techniques used in this tutorial, and an in-depth discussion on reducing data bias using these techniques, see our blog post or the documentation.

Credit Dataset

In this tutorial we will use a public credit scoring dataset from Kaggle, also available with the synthesized_datasets package:

import synthesized_datasets
import pandas as pd

df_orig = synthesized_datasets.CREDIT.credit.load()
df_orig

        SeriousDlqin2yrs  RevolvingUtilizationOfUnsecuredLines  age  ...  NumberOfTime60-89DaysPastDueNotWorse  NumberOfDependents
0                      1                              0.766127   45  ...                                     0                 2.0
1                      0                              0.957151   40  ...                                     0                 1.0
2                      0                              0.658180   38  ...                                     0                 0.0
3                      0                              0.233810   30  ...                                     0                 0.0
4                      0                              0.907239   49  ...                                     0                 0.0
...                  ...                                   ...  ...  ...                                   ...                 ...
149995                 0                              0.040674   74  ...                                     0                 0.0
149996                 0                              0.299745   44  ...                                     0                 2.0
149997                 0                              0.246044   58  ...                                     0                 0.0
149998                 0                              0.000000   30  ...                                     0                 0.0
149999                 0                              0.850283   64  ...                                     0                 0.0

[150000 rows x 11 columns]

The binary classification column "SeriousDlqin2yrs", denoting whether someone has defaulted on a loan within the last 2 years, will be the target variable while the remaining columns will be explanatory variables that will be used to train a classification model.

y_label = "SeriousDlqin2yrs"
x_labels = [col for col in df_orig.columns if col != y_label]

The target column is highly skewed resulting in a highly unbalanced dataset.

value_counts = df_orig[y_label].value_counts()
value_counts

>>> 0    139974
>>> 1     10026
>>> Name: SeriousDlqin2yrs, dtype: int64
pd.cut(df_orig[y_label],
       bins=[-0.5, 0.75, 1],
       labels = ['0','1'])\
  .value_counts(sort=False).plot.bar()
Skewed credit dataset

Training a Linear Classification Model

In the following, a linear RidgeClassifier model will be used to try and predict the value of the target variable (SeriousDlqin2yrs, i.e. whether a person will default) using the remainder of the columns as explanatory variables. The test-train-split technique will be used to evaluate the performance of the model in that task.

The preprocess() convenience function will be used to preprocess the data, used to train the RidgeClassifier model, using a fitted instance of the ModellingPreprocessor class available in the Synthesized SDK. The preprocessing will label or one-hot encode the categorical columns and transform the continuous columns using a StandardScaler.

from synthesized.insight.modelling import ModellingPreprocessor

def preprocess(
    df: pd.DataFrame,
    preprocessor: ModellingPreprocessor
):
    df_processed = preprocessor.transform(df)
    y = df_processed.pop(preprocessor.target).to_numpy()
    x = df_processed.to_numpy()
    return x, y

In the first instance, the RidgeClassifier will be trained on the original data:

from sklearn.model_selection import train_test_split

test_size = 0.2

df_train, df_test = train_test_split(
    df_orig,
    test_size=test_size,
    stratify=df_orig[y_label],
    random_state=42,
)

preprocessor = ModellingPreprocessor(target=y_label)
preprocessor.fit(df_orig)

x_train, y_train = preprocess(df_train, preprocessor)
x_test, y_test = preprocess(df_test, preprocessor)

The RidgeClassifier model will be fitted using the train subset of the data. The ability of the model to classify unseen data will then be evaluated using the test subset:

from sklearn.linear_model import RidgeClassifier
orig_classifier = RidgeClassifier()
orig_classifier.fit(x_train, y_train)

y_predict = orig_classifier.predict(x_test)

The area under the ROC-curve (AUC-ROC) is used as a means to evaluate the quality of the model in classifying unseen test data. The value of the AUC-ROC varies between 0 and 1, with 1 implying perfect separability between the two classes and 0 implying the exact opposite, i.e. the model is predicting 0’s as 1’s and 1’s as 0’s in our case. A value of 0.5 means that the model hasn’t learnt any difference between the classes at all and has no predictive capacity.

from sklearn.metrics import roc_auc_score
orig_roc_auc = roc_auc_score(y_test, y_predict)
orig_roc_auc

>>> 0.5703505568994107

The value of ~0.57 implies that the model has some predictive capacity, but not much better than a random guess.

Using Rebalanced Synthetic Data

The performance of the linear RidgeClassifier model in the previous section was, on the whole, pretty poor. The substandard results can be traced back to the data used to train the model in the first place - the highly imbalanced nature of the dataset means that there is only a very faint signal from the target variable.

A naive solution to this problem is to create a new dataset by oversampling data from the minority class and undersampling from the majority in order to achieve the desired distribution of classes. However, there is a very clear drawback to this method in that the new dataset may be significantly smaller than the original. Using this traditional technique, the issue has been transformed from not having enough quality data to potentially not having enough data at all.

More advanced methods, such as SMOTE, create entirely new data points for the minority class to augment the original dataset with. The downside of SMOTE is that there is no understanding of the statistics of the original data meaning that the correlations between variables is lost, degrading model performance.

Alternatively, the deep generative models utilised in the HighDimSynthesizer can be used to learn the statistical properties and correlations present in the original data and synthesize a dataset containing columns adhering to user defined distributions.

To generate the synthetic data we first create a HighDimSynthesizer instance using the meta data extracted from the original dataset:

from synthesized import HighDimSynthesizer, MetaExtractor
df_meta = MetaExtractor.extract(df_orig)
synth = HighDimSynthesizer(df_meta)

The HighDimSynthesizer instance is then trained using the train subset of the original that was defined above - the test subset is held back to prevent any possible data leaks.

synth.learn(df_train)

Rather than simply calling the HighDimSynthesizer.synthesize() method, the ConditionalSampler class can be used to generate completely new, synthetic data where the proportions of the two classes in the target variable have been rebalanced to occur in equal proportions.

An instance of the ConditionalSampler can be created by passing in a trained instance of the HighDimSynthesizer. To generate rebalanced data, the desired distributions of the classes in the target column are specified using the explicit_marginals argument of the sample() method:

from synthesized import ConditionalSampler
sampler = ConditionalSampler(synth)
explicit_marginals = {y_label: [(0, 0.5), (1, 0.5)]}
df_synth = sampler.sample(
    num_rows=len(df_train),
    explicit_marginals=explicit_marginals,
    max_trials=40
)

As a quick sanity check and to verify that the distribution of the target column has indeed been rebalanced, the Assessor module can be utilised in order to visually inspect the distributions of the continuous and categorical variables in the synthetic data compared to the original.

from synthesized.testing import Assessor
assessor = Assessor(df_meta)
assessor.show_distributions(df_train, df_synth);
Distributions of the rebalanced synthetic and original datasets

As demonstrated in the distributions, the target variable has been rebalanced such that the classes now appear in a 50/50 split. Due to the synthetic dataset having a greater proportion of the minority class than the original, some distributions have been changed. For example, the peak of the age distribution has been shifted to the left implying that younger individuals may be more likely to default.

The training of a RidgeClassifier model can now be conducted using the rebalanced synthetic dataset as the training data. For fairness of comparison, the same test dataset will be used to evaluate the performance of this model as was used when evaluating the performance of the model trained with original data. It is important to note that synthetic data should never be used as test data and should only be used when training the model of interest.

preprocessor = ModellingPreprocessor(target=y_label)
preprocessor.fit(pd.concat([df_orig, df_synth]))

x_synth, y_synth = preprocess(df_synth, preprocessor)

synth_classifier = RidgeClassifier()
synth_classifier.fit(x_synth, y_synth)

y_predict = synth_classifier.predict(x_test)

Using the AUC-ROC as a metric of model performance, the model trained with rebalanced data has demonstrated an enormous performance in distinguishing the two classes over the model trained with original data:

synth_roc_auc = roc_auc_score(y_test, y_predict)
synth_roc_auc

>>> 0.7306853423683158

synth_roc_auc / orig_roc_auc

>>> 1.2811162074435958

In general, if new set of synthetic data were generated, using the same ConditionalSampler instance, and a linear model trained with the resulting dataset we would observe that the ROC-AUC would fluctuate around a mean value. This non-deterministic behaviour is entirely deliberate and is due to the careful injection of noise at various stages of the HighDimSynthesizer training in order to ensure that there is no one-to-one mapping between any row in the synthetic data and any row in the original. Data anonymization is a key benefit of synthetic data over traditional techniques. For more information on synthetic data privacy, see the documentation.

While we would expect the precise value of synth_roc_auc to fluctuate around a mean, in this specific example the linear classifier model trained using synthetic data demonstrates a 25-30% improvement over the same model trained with original data.

Bootstrapping Original Data

In the above, we used rebalanced, purely synthetic data to train a linear classification model. However, the ConditionalSampler offers an alternative means to generate rebalanced data by augmenting the original data with synthetic data composed of the minority class using the alter_distributions() method.

The same ConditionalSampler instance created above will be utilised, but now to generate a DataFrame containing a mixture of real and synthetic data in the correct proportions such that we achieve the desired distribution of values.

When creating such a dataset, care must be taken when performing the test/train split. As mentioned above, it is important to ensure that no synthetic data is in the test portion (as it is desired to evaluate the performance of the linear classifier rather than the HighDimSynthesizer), so only the distributions of the test dataset should be altered.

explicit_marginals = {y_label: [(0, 0.5), (1, 0.5)]}
df_altered = sampler.alter_distributions(
    df_train,
    num_rows=len(df_train),
    explicit_marginals=explicit_marginals,
    max_trials=40
)

x_altered, y_altered = preprocess(df_altered, preprocessor)

altered_classifier = RidgeClassifier()
altered_classifier.fit(x_altered, y_altered)

y_predict = altered_classifier.predict(x_test)

altered_roc_auc = roc_auc_score(y_test, y_predict)
altered_roc_auc

>>> 0.7356642328809161

Using bootstrapped data, a very similar improvement in the performance of a linear classification model is observed as when using purely synthetic data.