Differential Privacy#

Synthesized can incorporate differential privacy techniques into the process that is used to build the generative model of the data. This ensures a mathematical guarantee of an individual’s privacy in the original data and reduces the the risk of data leakage in the Synthesized data.

Specifically, Synthesized utilizes \((\epsilon, \delta)\)-differential privacy defined by

\[\frac{\mathrm{P}(\mathcal{M}(D))}{\mathrm{P}(\mathcal{M}(D'))} \leq e^{\epsilon} + \delta\]

where \(D\) is the original data, \(D'\) is the original data with an individual’s row removed, \(\mathcal{M}\) is a stochastic function of the original data, and \(\mathrm{P}\) is the corresponding probability distribution. Differential privacy provides an upper-bound on the privacy loss of an individual, by ensuring that the output of the stochastic function \(\mathcal{M}\) differs by at most \(e^{\epsilon}\) when applied to a dataset with the invidivual, and a dataset excluding the individual. The parameter \(\delta\) relaxes the constraint, allowing differential privacy to be used in a greater range of applications.

Note

There is an unavoidable trade-off between the privacy and utility of a dataset. Smaller values of \(\epsilon\) provide stronger privacy guarantees, but will inherently reduce the usefulness of the data.

Differential privacy can be enabled by setting by setting differential_privacy = True in HighDimConfig:

In [1]: from synthesized.config import HighDimConfig

In [2]: config = HighDimConfig()

In [3]: config.differential_privacy = True

In addition, there are several parameters that will need to be adjusted depending on the desired level of privacy as well as the size of dataset being learned.

  • config.epsilon: This sets the desired level of \(\epsilon\), with smaller values producing more private data. Training of the model is aborted if this value is reached. It is important to note that it may not be possible to obtain the desired level of \(\epsilon\) as it depends strongly on the size of the dataset together with the amount of noise added.

  • config.noise_multiplier: The amount of noise added to ensure differential privacy can be achieved. Values are typically in the range 1.0 to 10.0. Higher values allow smaller values of \(\epsilon\) to be reached and therefore greater privacy, but lower data quality.

config.l2_norm_clip, config.num_microbatches, and config.delta can also be tuned, but it is recommended to keep them at their default value.

The HighDimConfig can then be passed to HighDimSynthesizer to train a Synthesizer with differential privacy guarantees.

In [4]: synthesizer = HighDimSynthesizer(df_meta, config=config)

In [5]: synthesizer.learn(...)

Warning

Enabling differential privacy may significantly slow down the learning process.

Once trained, the value of \(\epsilon\) reached by the Synthesizer for the particular dataset can be obtained with:

In [6]: synthesizer.epsilon

In [7]: 0.8