Overrides

Within the SDK, the MetaExtractor and HighDimSynthesizer automatically infer data types and the most appropriate method of modelling them. For example, consider the dataset shown below:

from synthesized import utils
df = utils.get_example_data()
print(df)

     SeriousDlqin2yrs  RevolvingUtilizationOfUnsecuredLines  age  NumberOfTime30-59DaysPastDueNotWorse  DebtRatio
1                   1                              0.766127   45                                   2.0   0.802982
2                   0                              0.957151   40                                   NaN   0.121876
3                   0                              0.658180   38                                   1.0   0.085113
4                   0                              0.233810   30                                   NaN   0.036050
5                   0                              0.907239   49                                   1.0   0.024926
..                ...                                   ...  ...                                   ...        ...
96                  0                              0.245353   37                                   0.0   0.288417
97                  0                              0.542243   48                                   2.0  10.000000
98                  0                              0.010531   57                                   0.0   0.280665
99                  0                              0.363200   32                                   0.0   0.480524
100                 0                              0.032618   75                                   0.0   0.006799

[100 rows x 5 columns]

Focussing specifically on the age column, the MetaExtractor would interpret this as an integer and the HighDimSynthesizer would model this as a continuous variable, rather than a categorical one, due to the number of unique entries as a proportion of the total number of rows. These assumptions allow the HighDimSynthesizer to generate completely new values of age, all of integer types. However, there are some cases where we may wish to override this behaviour - for instance, if we wish generate floats rather than integers, or if we didn’t want to generate any new age values and only wanted to sample from those already given then the column can be modelled in a categorical, rather than in a continuous fashion.

With the use of the type_overrides argument, Synthesized offers the ability to override the default behaviour of both the MetaExtractor and the HighDimSynthesizer to ensure the appropriate behaviour.

Determining default behaviour

Default Metas

To determine if it is necessary to override the default behaviour of the SDK modules, it is often desirable to first discover what that default behaviour is. Information regarding the inferred data types of columns within a dataset are contained within the meta data:

from synthesized import MetaExtractor
df_meta = MetaExtractor.extract(df)
print(df_meta.children)

>>> [<Ring[i8]: IntegerBool(name=SeriousDlqin2yrs)>,
... <Ring[f8]: Float(name=RevolvingUtilizationOfUnsecuredLines)>,
... <Scale[i8]: Integer(name=age)>,
... <Scale[i8]: Integer(name=NumberOfTime30-59DaysPastDueNotWorse)>,
... <Ring[f8]: Float(name=DebtRatio)>]

For the meantime, we’ll ignore the labels Scale and Ring and simply focus on the data types for each column. Focussing on just the inner set of labels and comparing with the raw dataset, the MetaExtractor has correctly determined that there are a mixture of Integer, Float and IntegerBool types within the data.

Default Models

Following this, we create a HighDimSynthesizer instance using df_meta. The HighDimSynthesizer object can then be inspected in order to determine how it will model each column:

from synthesized import HighDimSynthesizer
synth = HighDimSynthesizer(df_meta)
print(synth._df_model.children)

>>> [Histogram(meta=<Ring[i8]: IntegerBool(name=SeriousDlqin2yrs)>),
... KernelDensityEstimate(meta=<Ring[f8]: Float(name=RevolvingUtilizationOfUnsecuredLines)>),
... KernelDensityEstimate(meta=<Scale[i8]: Integer(name=age)>),
... Histogram(meta=<Scale[i8]: Integer(name=NumberOfTime30-59DaysPastDueNotWorse)>),
... KernelDensityEstimate(meta=<Ring[f8]: Float(name=DebtRatio)>)]

For each column, the HighDimSynthesizer has used the associated meta object to model the variable continuously, using KernelDensityEstimate, or categorically, using Histogram. The synth object can now be trained in order to generate synthetic data. Columns modelled continuously may have completely new values generated during this process, while columns that are modelled categorically will contain values already present in the original data. In both cases, the appropriate correlations are learnt between the various columns.

Following the procedure to identify the default behaviour of the MetaExtractor and HighDimSynthesizer, the type_overrides parameter can be used to modify the default behaviour. For more information on the Meta and Model modules see the Reference API docs.