Data Imputation#
Given a dataset, and a HighDimSynthesizer
trained on that dataset, DataImputer
is able to generate data for certain values. This is specially useful when the dataset contains missing values or
outliers, although it can be used to impute any value.
For example, given a dataset with the following structure:
Age |
MonthlyIncome |
Delinquent |
---|---|---|
23 |
1500 |
|
58 |
3800 |
NaN |
NaN |
2600 |
|
47 |
NaN |
|
72 |
3600 |
|
A DataImputer
object can get the information such as marginal and joint probability distributions provided by
the HighDimSynthesizer
and fill missing values with realistic new samples:
Age |
MonthlyIncome |
Delinquent |
---|---|---|
23 |
1500 |
|
58 |
3800 |
|
36 |
2600 |
|
47 |
3100 |
|
72 |
3600 |
|
Important
The output DataFrame will still contain original data for non-missing values, the DataImputer
will only generate
Synthesized data for missing values.
Imputing Missing Values#
In [1]: from synthesized import DataImputer
Data Imputation is achieved using the DataImputer
class. This requires a DataImputer
instance
that has already been learned on the desired dataset.
In [2]: data_imputer = DataImputer(synthesizer) # synthesizer is an HighDimSynthesizer instance
Once the DataImputer
has been initialized, the user can impute missing values to a given df: pd.DataFrame
with the following command:
In [3]: df_nans_imputed = data_imputer.impute_nans(df, inplace=False)
The DataImputer
will find all NaN
values in df
, fill them with new values, and return new data frame
df_nans_imputed
without missing values.
Note
With the inplace
argument, the user can control whether the given DataFrame is modified or a copy of it is
created, modified, and returned. After running data_imputer.impute_nans(df, inplace=True)
, df
will not contain
missing values.
It is recommended to use inplace=True
for big datasets in order to optimize memory usage.
Imputing Outliers#
Outliers in data can heavily decrease model performance if not treated carefully, as many loss functions (e.g. MSE) are
highly impacted by heavy tailed distributions. For these situations, the DataImputer
can reduce the number of
outliers by automatically detecting and imputing them with the following command:
In [4]: df_outliers_imputed = data_imputer.impute_outliers(df, outliers_percentile=0.05, inplace=False)
The output DataFrame df_outliers_imputed
will have the top 2.5% and bottom 2.5% values for each continuous column
filled by the corresponding values as learned from the HighDimSynthesizer
.
Note
For each column, the DataImputer
will use a percentile-based approach to detect outliers. If some other
approach is needed, it is recommended to create a boolean mask and use impute_mask()
as described below.
Imputing a Mask#
If the user needs to replace any other value (e.g. wrong values, anomalies…), they can do so by providing
a boolean mask DataFrame, with the same size and columns as the original DataFrame, where all True
values will be
computed from the HighDimSynthesizer
and False
values will be returned as they are.
In [5]: df_imputed = data_imputer.impute_mask(df, mask=df_mask, inplace=False)
For example given the credit-anomaly.csv below,
Age |
MonthlyIncome |
Delinquent |
---|---|---|
23 |
1500 |
|
58 |
921817402182 |
|
36 |
2600 |
|
9816 |
3600 |
|
the user can to impute values for detected anomalies (MonthlyIncome=921817402182
and age=9816
)
by creating the following mask and passing it to the data imputer:
In [6]: df = pd.read_csv("credit-anomaly.csv")
In [7]: df_mask = pd.DataFrame({
...: "Age": [False, False, False, True],
...: "MonthlyIncome": [False, True, False, False],
...: "Delinquent": [False, False, False, False]
...: })
...:
In [8]: df_imputed = data_imputer.impute_mask(df, mask=df_mask, inplace=False)