Strict Synthesis

The Sanitizer class

In the Synthesized package, synthetic data is created in such a way that no output row depends on any input row. This provides a strong line of defence against privacy attacks which try to use known columns of the data to infer sensitive columns. However, in some cases and by chance it might be the case that output synthetic data will match original data. To completely guard against this possibility, we provide the Sanitizer class.

In [1]: from synthesized.privacy import Sanitizer

# augment a trained synthesizer
In [2]: sanitizer = Sanitizer(synthesizer, df_orig=df)

In [3]: df_synth = sanitizer.synthesize(100)

The Sanitizer works by checking rows of synthetic data against original and replacing any matches by synthesizing additional rows, this will slightly distort generated data causing a drop in realism (Though this will likely be minimal in most cases).

Binning Continuous Values

The Sanitizer will bin continuous values to check whether they match one another. The width of these bins is given by the distance_step argument, the Sanitizer will take the bin width to be distance_step multiplied by the total range of the data in that column (excluding some outliers). The default value 1e-03 will split the data up into 1000 = 1 / 1e-03 equally sized bins.

Changing the number of Synthesis attempts

By default, the Sanitizer will attempt to Synthesize the data 3 times before giving up, for cases where overlaps are uncommon, this should be enough to generate the number of rows required. If instead, there are many overlaps and the Sanitizer doesn't generate enough rows, this setting can be changed. By increasing the max_synthesis_attempts argument the Sanitizer will attempt more synthesis steps to find different rows.