Data Imputation

Given a dataset, and a HighDimSynthesizer trained on that dataset, DataImputer is able to generate data for certain values. This is specially useful when the dataset contains missing values or outliers, although it can be used to impute any value.

For example, given a dataset with the following structure:

"Age"	"MonthlyIncome"	"Delinquent"
23	1500	`True`
58	3800	NaN
NaN	2600	`False`
47	NaN	`True`
72	3600	`False`

A DataImputer object can get the information such as marginal and joint probability distributions provided by the HighDimSynthesizer and fill missing values with realistic new samples:

"Age"	"MonthlyIncome"	"Delinquent"
23	1500	`True`
58	3800	`False`
36	2600	`False`
47	3100	`True`
72	3600	`False`

The output DataFrame will still contain original data for non-missing values, the DataImputer will only generate Synthesized data for missing values.

Imputing Missing Values

Data Imputation is achieved using the DataImputer class. This requires a HighDimSynthesizer instance that has already been learned on the desired dataset.

from synthesized import DataImputer
data_imputer = DataImputer(synthesizer) # synthesizer is an HighDimSynthesizer instance

Once the DataImputer has been initialized, the user can impute missing values to a given df: pd.DataFrame with the following command:

df_nans_imputed = data_imputer.impute_nans(df, inplace=False)

The DataImputer will find all NaN values in df, fill them with new values, and return new data frame df_nans_imputed without missing values.

With the inplace argument, the user can control whether the given DataFrame is modified or a copy of it is created, modified, and returned. After running data_imputer.impute_nans(df, inplace=True), df will not contain missing values.

It is recommended to use inplace=True for big datasets in order to optimize memory usage.

Imputing a Mask

If the user needs to replace any other value (e.g., wrong values, anomalies…), they can do so by providing a boolean mask DataFrame, with the same size and columns as the original DataFrame, where all True values will be computed from the HighDimSynthesizer and False values will be returned as they are.

df_imputed = data_imputer.impute_mask(df, mask=df_mask, inplace=False)

For example given the credit-anomaly.csv below:

"Age"	"MonthlyIncome"	"Delinquent"
23	1500	`True`
58	921817402182	`False`
36	2600	`False`
9816	3600	`True`

The user can to impute values for detected anomalies (MonthlyIncome=921817402182 and age=9816) by creating the following mask and passing it to the data imputer:

df = pd.read_csv("credit-anomaly.csv")
df_mask = pd.DataFrame({
    "Age": [False, False, False, True],
    "MonthlyIncome": [False, True, False, False],
    "Delinquent": [False, False, False, False]
})
df_imputed = data_imputer.impute_mask(df, mask=df_mask, inplace=False)