Data Evaluation Framework

Data quality can be assessed in many ways and it is important to structure assessments to ensure all important areas are covered. Below we present a structure based on testing progressively deeper correlations within the data. As successive tests are carried out, you gain more confidence that the generated data has captured the features of the raw data.

Successively deeper correlations

Column-wise distributions are preserved
Pairwise correlations between columns are preserved
Deeper underlying trends in the data are preserved
ML models trained on the both datasets give the same results
Biases are detected and presented
Privacy is maintained in the synthesized dataset

Data quality metrics

Table 1. Evaluation metrics
To Test	Use:	Included in SDK:
Column-wise distributions are preserved	Earth-Moving distance and Kolmogorov-Smirnov distances within 5% or raw data	Yes - with histogram visualization. See Univariate Metrics.
Pairwise correlations between columns are preserved	Cramér’s V, Kendall-Tau, and McFadden’s pseudo-R2 (i.e. categorical logistic regression) within 5% of raw data	Yes - with comparison matrix visualization. See Interaction Metrics and Comparing Joint Distributions.
Deeper underlying trends in the data are preserved/ML models trained on the both datasets give the same results	ML models trained on raw and synthesized data performance to be within 5% of raw data in terms of accuracy, precision, ROC/AUC, lift, F1 score, confusion matrices, etc.	Yes - see Predictive Modelling Score
Biases are detected and presented	The above distances with proportions measured between the categories within columns, presented clearly so humans can detect the bias	Yes - included in FairLens which Synthesized released open source.
Privacy is maintained in the synthesized dataset	t-closeness vs k-distance comparison with accepted privacy-utility threshold (e.g. no data with t-closeness > 0.3 and k-distance < 0.05). Run attribute inference attacks on generated data.	Yes - with visualization including highlighted linkage attack area, and pre-built attribute inference attack function. See Attribute Inference ML.