Predictive Utility#

In addition to comparing statistical metrics, Synthesized can train machine learning algorithms on the synthetic data and the original data to perform an arbitrary classification or regression task. The performance of the models on a hold-out test set of original data can be compared to determine whether the utility of the synthetic data has been maintained.

Predictive Modelling Score#

Synthesized provides an API synthesized.insight.metrics.predictive_modelling_score which calculates appropriate modelling metrics for the given dataset using the specified model. The model parameter can be either one of the following:

  • "Linear": linear regression model

  • "Logistic": logistic regression model

  • "GradientBoosting": gradient boosted decision tree

  • "RandomForest": random forest

  • "MLP": multi-layer perceptron (feed-forward neural network)

  • "LinearSVM": support vector machine

or alternatively a custom model class that inherits from the BaseEstimator together with the sklearn.base.ClassifierMixin or sklearn.base.RegressorMixin mixins.

The function will automatically determine whether the prediction task is a classification or regression problem, and will return either the ROC-AUC or R-squared metric, respectively. All necessary preprocessing (standard scaling, one-hot encoding…) is done under the hood.

In [1]: from synthesized.insight.metrics import predictive_modelling_score

In [2]: target = "column_to_predict"

In [3]: predictors = ["column_a", "column_b", "column_c"]

In [4]: score, metric, task = predictive_modelling_score(df_original, y_label=target, x_labels=predictors, model="GradientBoosting")

Predictive Modelling Comparison#

Synthesized can automatically train models and compare their performance on the original and synthetic data using the API synthesized.insight.metrics.predictive_modelling_comparison function. It requires the original data, the synthetic data, a target variable to predict, a list of predictor columns, and a model type.

In [5]: from synthesized.insight.metrics import predictive_modelling_comparison

In [6]: target = "column_to_predict"

In [7]: predictors = ["column_a", "column_b", "column_c"]

In [8]: score, synth_score, metric, task = predictive_modelling_comparison(
   ...:     df_original,
   ...:     df_synth,
   ...:     y_label=target,
   ...:     x_labels=predictors,
   ...:     model="GradientBoosting"
   ...: )


Similar to the statistical quality assessment, the Assessor object can be used to visualize the performance of a given model. The following functions related to modelling can be called from an assessor object:

  • plot_classification_metrics: Plot the ROC curve, PR curve and Confusion Matrix for the given classifier trained on two DataFrames, and evaluated on the same dataset.

  • plot_classification_metrics_test: Plot the ROC curve, PR curve and Confusion Matrix for the given classifier trained on the same DataFrame, and evaluated on two different datasets.

  • utility: Compute utility, a score of estimator trained on synthetic data and tested on original data.

In [9]: from synthesized import MetaExtractor

In [10]: from synthesized.testing import Assessor

In [11]: df_meta = MetaExtractor.extract(df)

In [12]: assessor = Assessor(df_meta)

In [13]: asr.plot_classification_metrics("TargetColumn", df_test, LogisticRegression())

Example plot of assessor.plot_classification_metrics().#