Data Imputation
Given a dataset, and a HighDimSynthesizer
trained on that dataset,
DataImputer
is able to generate data for certain values. This is specially
useful when the dataset contains missing values or outliers, although it can be
used to impute any value.
For example, given a dataset with the following structure:
"Age" | "MonthlyIncome" | "Delinquent" |
---|---|---|
23 |
1500 |
|
58 |
3800 |
NaN |
NaN |
2600 |
|
47 |
NaN |
|
72 |
3600 |
|
A DataImputer
object can get the information such as marginal and joint probability distributions provided by
the HighDimSynthesizer
and fill missing values with realistic new samples:
"Age" | "MonthlyIncome" | "Delinquent" |
---|---|---|
23 |
1500 |
|
58 |
3800 |
|
36 |
2600 |
|
47 |
3100 |
|
72 |
3600 |
|
The output DataFrame will still contain original data for non-missing values,
the |
Imputing Missing Values
Data Imputation is achieved using the DataImputer
class. This requires a HighDimSynthesizer
instance
that has already been learned on the desired dataset.
from synthesized import DataImputer
data_imputer = DataImputer(synthesizer) # synthesizer is an HighDimSynthesizer instance
Once the DataImputer
has been initialized, the user can impute missing values
to a given df: pd.DataFrame
with the following command:
df_nans_imputed = data_imputer.impute_nans(df, inplace=False)
The DataImputer
will find all NaN
values in df
, fill them with new
values, and return new data frame df_nans_imputed
without missing values.
With the It is recommended to use |
Imputing a Mask
If the user needs to replace any other value (e.g., wrong values, anomalies…),
they can do so by providing a boolean mask DataFrame, with the same size and
columns as the original DataFrame, where all True
values will be computed
from the HighDimSynthesizer
and False
values will be returned as they
are.
df_imputed = data_imputer.impute_mask(df, mask=df_mask, inplace=False)
For example given the credit-anomaly.csv below:
"Age" | "MonthlyIncome" | "Delinquent" |
---|---|---|
23 |
1500 |
|
58 |
921817402182 |
|
36 |
2600 |
|
9816 |
3600 |
|
The user can to impute values for detected anomalies
(MonthlyIncome=921817402182
and age=9816
) by creating the following
mask and passing it to the data imputer:
df = pd.read_csv("credit-anomaly.csv")
df_mask = pd.DataFrame({
"Age": [False, False, False, True],
"MonthlyIncome": [False, True, False, False],
"Delinquent": [False, False, False, False]
})
df_imputed = data_imputer.impute_mask(df, mask=df_mask, inplace=False)