With the SDK, synthetic data is created in such a way that no output row depends on any input row.
This provides a strong line of defence against privacy attacks which try to use known columns of the data to infer
sensitive columns. However, in some cases and by chance it might be the case that output synthetic data will match
original data. To completely guard against this possibility, we provide the
from synthesized.privacy import Sanitizer sanitizer = Sanitizer(synthesizer, df_orig=df) (1) df_synth = sanitizer.synthesize(100)
Sanitizer works by checking rows of synthetic data against original and replacing any matches by synthesizing
additional rows, this will slightly distort generated data causing a drop in realism (although this will likely be
minimal in most cases).
Sanitizer will bin continuous values to check whether they match one another. The width of these bins is given by
distance_step argument. The
Sanitizer will take the bin width to be
distance_step multiplied by the total
range of the data in that column (excluding some outliers). The default value
1e-03 will split the data up into
1000 = 1 / 1e-03 equally sized bins.
By default, the
Sanitizer will attempt to Synthesize the data 3 times before giving up, for cases where overlaps are
uncommon, this should be enough to generate the number of rows required. If instead, there are many overlaps and the
Sanitizer doesn’t generate enough rows, this setting can be changed. By increasing the
max_synthesis_attempts argument the
Sanitizer will attempt more synthesis steps
to find different rows.