Sanitizer#
synthesized.privacy.Sanitizer
- class Sanitizer(synthesizer, df_orig, distance_step=1e-3, max_synthesis_attempts=3)#
Generates synthetic data without any original row present in the output.
Floats are uniformly binned for comparison using the
distance_step
parameter, e.g: for a value of 1e-03, the the column will be binned into 1000 (=1 / 1e-03) bins. The calculated bins ignores outliers.Examples:
>>> df_meta = MetaExtractor.extract(df) >>> synthesizer = HighDimSynthesizer(df_meta) >>> synthesizer.learn(df, num_iterations=10, hide_progress=True) >>> sanitizer = Sanitizer(synthesizer, df_original=df, max_synthesis_attempts=20) >>> df_synth = sanitizer.synthesize(100)
Methods
find_unique
(df_synth, df_orig, distances[, ...])Method finds the rows in df_synth that do not appear in df_orig, returns a boolean mask for these rows
learn
(df_train, num_iterations[, callback, ...])Train the generative model for the given iterations.
synthesize
(num_rows[, produce_nans, ...])Generate the given number of new data rows ensuring no matches to the .
- static find_unique(df_synth, df_orig, distances, n_col_intersect=None, skip_categorical=False, df_meta=None)#
Method finds the rows in df_synth that do not appear in df_orig, returns a boolean mask for these rows
- Parameters
df_synth (
DataFrame
) – pd.DataFrame, synthetic data to find unique rows fromdf_orig (
DataFrame
) – pd.DataFrame, original data to compare against.distances (
Dict
[str
,Optional
[float
]]) – Dict[str, Union[float, None]] size of bins to use for each continuous column, a none value denotes a categorical columns.n_cols – int, number of columns that need to match before two rows are determined to be unique, by default, requires all columns to match.
skip_categorical (
bool
) – bool, whether to disallow for matches of only categorical columns, by default, False.df_meta (
Optional
[DataFrameMeta
]) – DataFrameMeta to inform binning process, optional
- Return type
Index
- learn(df_train, num_iterations, callback=None, callback_freq=0)#
Train the generative model for the given iterations.
Repeated calls continue training the model, possibly on different data.
- Parameters
df_train (
DataFrame
) – The training data.num_iterations (
Optional
[int
]) – The number of training iterations (not epochs).callback (
Optional
[Callable
[[object
,int
,dict
],bool
]]) – A callback function, e.g. for logging purposes. Takes the synthesizer instance, the iteration number, and a dictionary of values (usually the losses) as arguments. Aborts training if the return value is True.callback_freq (
int
) – Callback frequency.
- Return type
None
- synthesize(num_rows, produce_nans=False, progress_callback=None, n_col_intersect=None, skip_categorical=False)#
Generate the given number of new data rows ensuring no matches to the .
- Parameters
num_rows (int) – Number of rows to generate.
produce_nans (bool, optional) – Generate NaN values. Defaults to False.
progress_callback (Callable, optional) – Progress bar callback. Defaults to None.
skip_categorical (
bool
) – bool, whether to disallow for matches of only categorical columns, by default, False.
- Return type
DataFrame
- Returns
The generated data.