Sanitizer#

synthesized.privacy.Sanitizer

class Sanitizer(synthesizer, df_orig, distance_step=1e-3, max_synthesis_attempts=3)#

Generates synthetic data without any original row present in the output.

Floats are uniformly binned for comparison using the distance_step parameter, e.g: for a value of 1e-03, the the column will be binned into 1000 (=1 / 1e-03) bins. The calculated bins ignores outliers.

Examples:

>>> df_meta = MetaExtractor.extract(df)
>>> synthesizer = HighDimSynthesizer(df_meta)
>>> synthesizer.learn(df, num_iterations=10, hide_progress=True)
>>> sanitizer = Sanitizer(synthesizer, df_original=df, max_synthesis_attempts=20)
>>> df_synth = sanitizer.synthesize(100)

Methods

find_unique(df_synth, df_orig, distances[, ...])

Method finds the rows in df_synth that do not appear in df_orig, returns a boolean mask for these rows

learn(df_train, num_iterations[, callback, ...])

Train the generative model for the given iterations.

synthesize(num_rows[, produce_nans, ...])

Generate the given number of new data rows ensuring no matches to the .

static find_unique(df_synth, df_orig, distances, n_col_intersect=None, skip_categorical=False, df_meta=None)#

Method finds the rows in df_synth that do not appear in df_orig, returns a boolean mask for these rows

Parameters
  • df_synth (DataFrame) – pd.DataFrame, synthetic data to find unique rows from

  • df_orig (DataFrame) – pd.DataFrame, original data to compare against.

  • distances (Dict[str, Optional[float]]) – Dict[str, Union[float, None]] size of bins to use for each continuous column, a none value denotes a categorical columns.

  • n_cols – int, number of columns that need to match before two rows are determined to be unique, by default, requires all columns to match.

  • skip_categorical (bool) – bool, whether to disallow for matches of only categorical columns, by default, False.

  • df_meta (Optional[DataFrameMeta]) – DataFrameMeta to inform binning process, optional

Return type

Index

learn(df_train, num_iterations, callback=None, callback_freq=0)#

Train the generative model for the given iterations.

Repeated calls continue training the model, possibly on different data.

Parameters
  • df_train (DataFrame) – The training data.

  • num_iterations (Optional[int]) – The number of training iterations (not epochs).

  • callback (Optional[Callable[[object, int, dict], bool]]) – A callback function, e.g. for logging purposes. Takes the synthesizer instance, the iteration number, and a dictionary of values (usually the losses) as arguments. Aborts training if the return value is True.

  • callback_freq (int) – Callback frequency.

Return type

None

synthesize(num_rows, produce_nans=False, progress_callback=None, n_col_intersect=None, skip_categorical=False)#

Generate the given number of new data rows ensuring no matches to the .

Parameters
  • num_rows (int) – Number of rows to generate.

  • produce_nans (bool, optional) – Generate NaN values. Defaults to False.

  • progress_callback (Callable, optional) – Progress bar callback. Defaults to None.

  • skip_categorical (bool) – bool, whether to disallow for matches of only categorical columns, by default, False.

Return type

DataFrame

Returns

The generated data.