Data Evaluation Framework

Data quality can be assessed in many ways and it is important to structure assessments to ensure all important areas are covered. Below we present a structure based on testing progressively deeper correlations within the data. As successive tests are carried out, you gain more confidence that the generated data has captured the features of the raw data.

Successively deeper correlations

  1. Column-wise distributions are preserved

  2. Pairwise correlations between columns are preserved

  3. Deeper underlying trends in the data are preserved

  4. ML models trained on the both datasets give the same results

  5. Biases are detected and presented

  6. Privacy is maintained in the synthesized dataset

Data quality metrics
Table 1. Evaluation metrics

To Test

Use:

Included in SDK:

Column-wise distributions are preserved

Earth-Moving distance and Kolmogorov-Smirnov distances within 5% or raw data

Yes - with histogram visualization. See Univariate Metrics.

Pairwise correlations between columns are preserved

Cramér’s V, Kendall-Tau, and McFadden’s pseudo-R2 (i.e. categorical logistic regression) within 5% of raw data

Yes - with comparison matrix visualization. See Interaction Metrics and Comparing Joint Distributions.

Deeper underlying trends in the data are preserved/ML models trained on the both datasets give the same results

ML models trained on raw and synthesized data performance to be within 5% of raw data in terms of accuracy, precision, ROC/AUC, lift, F1 score, confusion matrices, etc.

Yes - see Predictive Modelling Score

Biases are detected and presented

The above distances with proportions measured between the categories within columns, presented clearly so humans can detect the bias

Yes - included in FairLens which Synthesized released open source.

Privacy is maintained in the synthesized dataset

t-closeness vs k-distance comparison with accepted privacy-utility threshold (e.g. no data with t-closeness > 0.3 and k-distance < 0.05). Run attribute inference attacks on generated data.

Yes - with visualization including highlighted linkage attack area, and pre-built attribute inference attack function. See Attribute Inference ML.