Differential Privacy

Synthesized can incorporate differential privacy techniques into the process that is used to build the generative model of the data. This ensures a mathematical guarantee of an individual’s privacy in the original data and reduces the the risk of data leakage in the Synthesized data.

Specifically, Synthesized utilizes -differential privacy defined by

where is the original data, is the original data with an individual’s row removed, is a stochastic function of the original data, and is the corresponding probability distribution. Differential privacy provides an upper-bound on the privacy loss of an individual, by ensuring that the output of the stochastic function differs by at most when applied to a dataset with the individual, and a dataset excluding the individual. The parameter relaxes the constraint, allowing differential privacy to be used in a greater range of applications.

In order to use the differential privacy features of the SDK, you must install synthesized with the differential privacy extras. This can be done by running:

pip install 'synthesized[dp]'

Differential privacy can be enabled by setting by setting differential_privacy = True in HighDimConfig:

from synthesized.config import HighDimConfig
config = HighDimConfig()
config.differential_privacy = True

In addition, there are several parameters that will need to be adjusted depending on the desired level of privacy as well as the size of dataset being learned.

config.epsilon: This sets the desired level of , with smaller values producing more private data. Training of the model is aborted if this value is reached. It is important to note that it may not be possible to obtain the desired level of as it depends strongly on the size of the dataset together with the amount of noise added.
config.delta: Sets the value of which bounds the probability of the privacy guarantee not holding. A rule of thumb is to set to be less than the inverse of the size of the training dataset. The default value of is therefore set to be 1/(10*len(dataset)) to ensure this rule holds true. It is recommended to leave this parameter at its default value.
config.noise_multiplier: The amount of noise added to ensure differential privacy can be achieved. Values are typically in the range 1.0 to 10.0. Higher values allow smaller values of to be reached and therefore greater privacy, but lower data quality.
config.l2_norm_clip and config.num_microbatches can also be tuned, but it is recommended to keep them at their default value.

config.epsilon = 1.0
config.delta = 1/(10*len(df))
config.noise_multiplier = 1.0
config.num_microbatches = 1
config.l2_norm_clip = 1.0

The HighDimConfig can then be passed to HighDimSynthesizer to train a Synthesizer with differential privacy guarantees.

from synthesized import HighDimSynthesizer

synthesizer = HighDimSynthesizer(df_meta, config=config)
synthesizer.learn(...)

Enabling differential privacy may slow down the learning process.

Once trained, the value of reached by the Synthesizer for that particular dataset training run is stored in synthesizer.epsilon.

Note that due to the method used to learn the Synthesizer model, it is not possible to directly choose the desired . However, if the model is trained for a fixed number of iterations, can be calculated using the common.util.get_privacy_budget function.

from synthesized.common.differential_privacy import get_privacy_budget

epsilon = get_privacy_budget(noise_multiplier=1.2, steps=1000, batch_size=128, data_size=10000)