Strict Synthesis
The Sanitizer class
With the SDK, synthetic data is created in such a way that no output row depends on any input row.
This provides a strong line of defence against privacy attacks which try to use known columns of the data to infer
sensitive columns. However, in some cases and by chance it might be the case that output synthetic data will match
original data. To completely guard against this possibility, we provide the
Sanitizer
class.
from synthesized.privacy import Sanitizer
sanitizer = Sanitizer(synthesizer, df_orig=df) (1)
df_synth = sanitizer.synthesize(100)
1 | The Sanitizer class is a wrapper around a trained synthesizer. It provides a method to sanitize the synthetic data. |
The Sanitizer
works by checking rows of synthetic data against original and replacing any matches by synthesizing
additional rows, this will slightly distort generated data causing a drop in realism (although this will likely be
minimal in most cases).
Binning Continuous Values
The Sanitizer
will bin continuous values to check whether they match one another. The width of these bins is given by
the distance_step
argument. The Sanitizer
will take the bin width to be distance_step
multiplied by the total
range of the data in that column (excluding some outliers). The default value 1e-03
will split the data up into
1000 = 1 / 1e-03
equally sized bins.
Changing the number of Synthesis attempts
By default, the Sanitizer
will attempt to Synthesize the data 3 times before giving up, for cases where overlaps are
uncommon, this should be enough to generate the number of rows required. If instead, there are many overlaps and the
Sanitizer
doesn’t generate enough rows, this setting can be changed. By increasing the
max_synthesis_attempts
argument the Sanitizer
will attempt more synthesis steps
to find different rows.