Synthesized can incorporate differential privacy techniques into the process that is used to build the generative model of the data. This ensures a mathematical guarantee of an individual’s privacy in the original data and reduces the the risk of data leakage in the Synthesized data.
Specifically, Synthesized utilizes -differential privacy defined by
where is the original data, is the original data with an individual’s row removed, is a stochastic function of the original data, and is the corresponding probability distribution. Differential privacy provides an upper-bound on the privacy loss of an individual, by ensuring that the output of the stochastic function differs by at most when applied to a dataset with the invidivual, and a dataset excluding the individual. The parameter relaxes the constraint, allowing differential privacy to be used in a greater range of applications.
Differential privacy can be enabled by setting by setting
differential_privacy = True in
from synthesized.config import HighDimConfig config = HighDimConfig() config.differential_privacy = True
In addition, there are several parameters that will need to be adjusted depending on the desired level of privacy as well as the size of dataset being learned.
config.epsilon: This sets the desired level of , with smaller values producing more private data. Training of the model is aborted if this value is reached. It is important to note that it may not be possible to obtain the desired level of as it depends strongly on the size of the dataset together with the amount of noise added.
config.delta: Sets the value of which bounds the probability of the privacy guarantee not holding. A rule of thumb is to set to be less than the inverse of the size of the training dataset. The default value of is therefore set to be
1/(10*len(dataset))to ensure this rule holds true. It is recommended to leave this parameter at its default value.
config.noise_multiplier: The amount of noise added to ensure differential privacy can be achieved. Values are typically in the range
10.0. Higher values allow smaller values of to be reached and therefore greater privacy, but lower data quality.
config.num_microbatchescan also be tuned, but it is recommended to keep them at their default value.
config.epsilon = 1.0 config.delta = 1/(10*len(df)) config.noise_multiplier = 1.0 config.num_microbatches = 1 config.l2_norm_clip = 1.0
HighDimConfig can then be passed to
HighDimSynthesizer to train a
Synthesizer with differential privacy guarantees.
from synthesized import HighDimSynthesizer synthesizer = HighDimSynthesizer(df_meta, config=config) synthesizer.learn(...)
|Enabling differential privacy may slow down the learning process.|
Once trained, the value of reached by the Synthesizer for that particular dataset training run is
Note that due to the method used to learn the Synthesizer model, it is not
possible to directly choose the desired . However, if the model
is trained for a fixed number of iterations, can be calculated
from synthesized.common.differential_privacy import get_privacy_budget epsilon = get_privacy_budget(noise_multiplier=1.2, steps=1000, batch_size=128, data_size=10000)