synthesized.complex.DataImputer

class DataImputer(synthesizer)

Impute values (e.g missing values, outliers) in original data using a Synthesizer to generate realistic data points.

Parameters

synthesizer (HighDimSynthesizer) – Trained Synthesizer instance used to impute data.

Methods

__init__(synthesizer)

Initialize self.

impute_mask(df, mask[, produce_nans, …])

Imputes values within a dataframe from a given mask using the underlying synthesizer.

impute_nans(df[, inplace, progress_callback])

Impute NaN values within a dataframe using the underlying synthesizer.

impute_outliers(df[, outliers_percentile, …])

Impute outlier values in a DataFrame that are determined by a percentile threshold.

learn(df_train, num_iterations[, callback, …])

Train the generative model for the given iterations.

impute_mask(df, mask, produce_nans=False, inplace=False, progress_callback=None)

Imputes values within a dataframe from a given mask using the underlying synthesizer.

Parameters
  • df (pd.DataFrame) – The data in which to impute values.

  • mask (pd.DataFrame) – A boolean mask that contains True for those values to be imputed, and False for the values to remain unchanged.

  • produce_nans (bool, optional) – Whether to produce nans when imputing values for given mask. Defaults to False.

  • inplace (bool, optional) – If True, modifies the given dataframe in place. Defaults to False.

  • progress_callback (Optional[Callable[int, None]]) –

Return type

DataFrame

Returns

The DataFrame with masked values imputed.

impute_nans(df, inplace=False, progress_callback=None)

Impute NaN values within a dataframe using the underlying synthesizer.

Parameters
  • df (pd.DataFrame) – The data in which to impute values.

  • inplace (bool, optional) – If True, modifies the given dataframe in place. Defaults to False.

  • progress_callback (Optional[Callable[int, None]]) –

Return type

DataFrame

Returns

The DataFrame with NaN values imputed.

impute_outliers(df, outliers_percentile=0.05, inplace=False, progress_callback=None)

Impute outlier values in a DataFrame that are determined by a percentile threshold.

Parameters
  • df (pd.DataFrame) – The data in which to impute values.

  • outliers_percentile (float, optional) – The percentile threshold for classifying outliers. All values outside of these percentiles are considered outliers and will be imputed. Defaults to 0.05.

  • inplace (bool, optional) – If True, modifies the given dataframe in place. Defaults to False.

  • progress_callback (Optional[Callable[int, None]]) –

Return type

DataFrame

Returns

The DataFrame with outliers imputed.

learn(df_train, num_iterations, callback=None, callback_freq=0)

Train the generative model for the given iterations.

Repeated calls continue training the model, possibly on different data.

Parameters
  • df_train (DataFrame) – The training data.

  • num_iterations (Optional[int]) – The number of training iterations (not epochs).

  • callback (Optional[Callable[[object, int, dict], bool]]) – A callback function, e.g. for logging purposes. Takes the synthesizer instance, the iteration number, and a dictionary of values (usually the losses) as arguments. Aborts training if the return value is True.

  • callback_freq (int) – Callback frequency.

Return type

None