Data Imputation

Given a dataset, and a HighDimSynthesizer trained on that dataset, DataImputer is able to generate data for certain values. This is specially useful when the dataset contains missing values or outliers, although it can be used to impute any value.

For example, given a dataset with the following structure:

"Age" "MonthlyIncome" "Delinquent"

23

1500

True

58

3800

NaN

NaN

2600

False

47

NaN

True

72

3600

False

A DataImputer object can get the information such as marginal and joint probability distributions provided by the HighDimSynthesizer and fill missing values with realistic new samples:

"Age" "MonthlyIncome" "Delinquent"

23

1500

True

58

3800

False

36

2600

False

47

3100

True

72

3600

False

The output DataFrame will still contain original data for non-missing values, the DataImputer will only generate Synthesized data for missing values.

Imputing Missing Values

Data Imputation is achieved using the DataImputer class. This requires a HighDimSynthesizer instance that has already been learned on the desired dataset.

from synthesized import DataImputer
data_imputer = DataImputer(synthesizer) # synthesizer is an HighDimSynthesizer instance

Once the DataImputer has been initialized, the user can impute missing values to a given df: pd.DataFrame with the following command:

df_nans_imputed = data_imputer.impute_nans(df, inplace=False)

The DataImputer will find all NaN values in df, fill them with new values, and return new data frame df_nans_imputed without missing values.

With the inplace argument, the user can control whether the given DataFrame is modified or a copy of it is created, modified, and returned. After running data_imputer.impute_nans(df, inplace=True), df will not contain missing values.

It is recommended to use inplace=True for big datasets in order to optimize memory usage.

Imputing a Mask

If the user needs to replace any other value (e.g., wrong values, anomalies…​), they can do so by providing a boolean mask DataFrame, with the same size and columns as the original DataFrame, where all True values will be computed from the HighDimSynthesizer and False values will be returned as they are.

df_imputed = data_imputer.impute_mask(df, mask=df_mask, inplace=False)

For example given the credit-anomaly.csv below:

"Age" "MonthlyIncome" "Delinquent"

23

1500

True

58

921817402182

False

36

2600

False

9816

3600

True

The user can to impute values for detected anomalies (MonthlyIncome=921817402182 and age=9816) by creating the following mask and passing it to the data imputer:

df = pd.read_csv("credit-anomaly.csv")
df_mask = pd.DataFrame({
    "Age": [False, False, False, True],
    "MonthlyIncome": [False, True, False, False],
    "Delinquent": [False, False, False, False]
})
df_imputed = data_imputer.impute_mask(df, mask=df_mask, inplace=False)