Privacy#

Synthesized’s privacy module provides various ways to assess the robustness of synthesized data against different types of attribute inference attack.

Attribute inference attack refers to the situation when an attacker might deduce, with significant probability, the value of a hidden sensitive attribute from the values of other attributes. In practice, the attacker will have full access to the synthetic data and partial access to the original data. The attacker will train a model using synthetic data, and then use the trained model to predict the unknown value of the sensitive attribute using the known attributes of the original data. Hence, it is important and useful to assess the vulnerability of synthetic dataset against the risk of inference attacks so that the privacy and confidentiality of the original data is preserved.

Synthesized provides two main classes to assess the attribute inference attack:

Machine Learning (ML) models or Correct Attribution Probability (CAP) models are fit to the synthetic data. The fitted model is then used to compute the privacy score of a sensitive column of the original dataset using predictors in the original dataset. Privacy scores are between 0 and 1; 0 means negligible privacy and 1 means absolute privacy.

Attribute Inference Attack using ML#

AttributeInferenceAttackML trains a machine learning model to predict the sensitive attribute using the synthesized dataset. The fitted model is then used to predict the sensitive values in the original dataset. Finally, a privacy score is calculated based on the true value and the predicted value of the sensitive column in the original dataset.

AttributeInferenceAttackML package can be imported as:

In [1]: from synthesized.insight.metrics.privacy import AttributeInferenceAttackML

The following example shows how to use it step-by-step:

Firstly, initialize AttributeInferenceAttackML with the name of choice of model, the list of the column names which are predictors and the name of the sensitive column.

In [2]: predictors = ['RevolvingUtilizationOfUnsecuredLines', 'SeriousDlqin2yrs']

In [3]: sensitive_col = 'age'

In [4]: privacy_metrics = AttributeInferenceAttackML('Linear', sensitive_col, predictors)

Next, call the class object with the original dataset and the synthetic dataset to compute the privacy score of the synthetic dataset.

In [5]: privacy_score = privacy_metrics(orig_df=credit_df, synth_df=credit_df_synth)

In [6]: print(privacy_score)

In [7]: >>> 0.1158

Attribute Inference Attack using CAP#

AttributeInferenceAttackCAP computes the privacy score using CAP (Correct Attribution Probability) model. It is modelled as the probability that an attribution is correct. It differs from the ML approach because it doesn’t depend on the choice of the ML model and its training.

It will find all the rows in the synthetic dataset corresponding to each predictors key of the original dataset and then fetch the list of the sensitive entries from these rows of the synthetic dataset. The frequency of the correct sensitive entry of the original dataset in this list of sensitive entries is used to compute the privacy score.

AttributeInferenceAttackCAP package can be used as:

In [8]: from synthesized.insight.metrics.privacy import AttributeInferenceAttackCAP

Given below are the two ways to filter the rows in synthetic dataset corresponding to the predictors key of the original dataset.

GeneralizedCAP#

GeneralizedCAP finds all the rows in the synthetic dataset that match exactly to the predictors key of the original dataset

In [9]: predictors = ['NumberOfTime30-59DaysPastDueNotWorse', 'age']

In [10]: sensitive_col = 'SeriousDlqin2yrs'

In [11]: privacy_metrics = AttributeInferenceAttackCAP('GeneralizedCAP', sensitive_col, predictors)

In [12]: privacy_score = privacy_metrics(orig_df=credit_df, synth_df=credit_df_synth)

In [13]: print(privacy_score)

In [14]: >>> 0.2398

DistanceCAP#

DistanceCAP finds all the rows in the synthetic dataset that are closest neighbours (in terms of Hamming distance) to the predictors key of the original dataset

In [15]: predictors = ['RevolvingUtilizationOfUnsecuredLines', 'age']

In [16]: sensitive_col = 'SeriousDlqin2yrs'

In [17]: privacy_metrics = AttributeInferenceAttackCAP('DistanceCAP', sensitive_col, predictors)

In [18]: privacy_score = privacy_metrics(orig_df=credit_df, synth_df=credit_df_synth)

In [19]: print(privacy_score)

In [20]: >>> 0.2385

Note

If the predictor columns name list is not provided as an argument during initialization of the above classes then all the columns, except the sensitive column, will be used as predictors.