Differential Privacy
Synthesized can incorporate differential privacy techniques into the process that is used to build the generative model of the data. This ensures a mathematical guarantee of an individual’s privacy in the original data and reduces the the risk of data leakage in the Synthesized data.
Specifically, Synthesized utilizes -differential privacy defined by
where is the original data, is the original data with an individual’s row removed, is a stochastic function of the original data, and is the corresponding probability distribution. Differential privacy provides an upper-bound on the privacy loss of an individual, by ensuring that the output of the stochastic function differs by at most when applied to a dataset with the individual, and a dataset excluding the individual. The parameter relaxes the constraint, allowing differential privacy to be used in a greater range of applications.
In order to use the differential privacy features of the SDK, you must install synthesized with the differential privacy extras. This can be done by running:
|
Differential privacy can be enabled by setting by setting
differential_privacy = True
in HighDimConfig
:
from synthesized.config import HighDimConfig
config = HighDimConfig()
config.differential_privacy = True
In addition, there are several parameters that will need to be adjusted depending on the desired level of privacy as well as the size of dataset being learned.
-
config.epsilon
: This sets the desired level of , with smaller values producing more private data. Training of the model is aborted if this value is reached. It is important to note that it may not be possible to obtain the desired level of as it depends strongly on the size of the dataset together with the amount of noise added. -
config.delta
: Sets the value of which bounds the probability of the privacy guarantee not holding. A rule of thumb is to set to be less than the inverse of the size of the training dataset. The default value of is therefore set to be1/(10*len(dataset))
to ensure this rule holds true. It is recommended to leave this parameter at its default value. -
config.noise_multiplier
: The amount of noise added to ensure differential privacy can be achieved. Values are typically in the range1.0
to10.0
. Higher values allow smaller values of to be reached and therefore greater privacy, but lower data quality. -
config.l2_norm_clip
andconfig.num_microbatches
can also be tuned, but it is recommended to keep them at their default value.
config.epsilon = 1.0
config.delta = 1/(10*len(df))
config.noise_multiplier = 1.0
config.num_microbatches = 1
config.l2_norm_clip = 1.0
The HighDimConfig
can then be passed to HighDimSynthesizer
to train a
Synthesizer with differential privacy guarantees.
from synthesized import HighDimSynthesizer
synthesizer = HighDimSynthesizer(df_meta, config=config)
synthesizer.learn(...)
Enabling differential privacy may slow down the learning process. |
Once trained, the value of reached by the Synthesizer for that
particular dataset training run is stored in synthesizer.epsilon
.
Note that due to the method used to learn the Synthesizer model, it is not
possible to directly choose the desired . However, if the model
is trained for a fixed number of iterations, can be calculated
using the |
from synthesized.common.differential_privacy import get_privacy_budget
epsilon = get_privacy_budget(noise_multiplier=1.2, steps=1000, batch_size=128, data_size=10000)