Data Evaluation Framework
Data quality can be assessed in many ways and it is important to structure assessments to ensure all important areas are covered. Below we present a structure based on testing progressively deeper correlations within the data. As successive tests are carried out, you gain more confidence that the generated data has captured the features of the raw data.
Successively deeper correlations
-
Column-wise distributions are preserved
-
Pairwise correlations between columns are preserved
-
Deeper underlying trends in the data are preserved
-
ML models trained on the both datasets give the same results
-
Biases are detected and presented
-
Privacy is maintained in the synthesized dataset
Data quality metrics
To Test |
Use: |
Included in SDK: |
Column-wise distributions are preserved |
Earth-Moving distance and Kolmogorov-Smirnov distances within 5% or raw data |
Yes - with histogram visualization. See Univariate Metrics. |
Pairwise correlations between columns are preserved |
Cramér’s V, Kendall-Tau, and McFadden’s pseudo-R2 (i.e. categorical logistic regression) within 5% of raw data |
Yes - with comparison matrix visualization. See Interaction Metrics and Comparing Joint Distributions. |
Deeper underlying trends in the data are preserved/ML models trained on the both datasets give the same results |
ML models trained on raw and synthesized data performance to be within 5% of raw data in terms of accuracy, precision, ROC/AUC, lift, F1 score, confusion matrices, etc. |
Yes - see Predictive Modelling Score |
Biases are detected and presented |
The above distances with proportions measured between the categories within columns, presented clearly so humans can detect the bias |
Yes - included in FairLens which Synthesized released open source. |
Privacy is maintained in the synthesized dataset |
t-closeness vs k-distance comparison with accepted privacy-utility threshold (e.g. no data with t-closeness > 0.3 and k-distance < 0.05). Run attribute inference attacks on generated data. |
Yes - with visualization including highlighted linkage attack area, and pre-built attribute inference attack function. See Attribute Inference ML. |