Overrides
Within the SDK, the MetaExtractor
and HighDimSynthesizer
automatically infer data types and the most appropriate
method of modelling them. For example, consider the dataset shown below:
from synthesized import utils
df = utils.get_example_data()
print(df)
SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age NumberOfTime30-59DaysPastDueNotWorse DebtRatio
1 1 0.766127 45 2.0 0.802982
2 0 0.957151 40 NaN 0.121876
3 0 0.658180 38 1.0 0.085113
4 0 0.233810 30 NaN 0.036050
5 0 0.907239 49 1.0 0.024926
.. ... ... ... ... ...
96 0 0.245353 37 0.0 0.288417
97 0 0.542243 48 2.0 10.000000
98 0 0.010531 57 0.0 0.280665
99 0 0.363200 32 0.0 0.480524
100 0 0.032618 75 0.0 0.006799
[100 rows x 5 columns]
Focussing specifically on the age
column, the MetaExtractor
would interpret this as an integer and the
HighDimSynthesizer
would model this as a continuous variable, rather than a categorical one, due to the number of
unique entries as a proportion of the total number of rows. These assumptions allow the HighDimSynthesizer
to generate
completely new values of age
, all of integer types. However, there are some cases where we may wish to override this
behaviour - for instance, if we wish generate floats rather than integers, or if we didn’t want to generate any new
age
values and only wanted to sample from those already given then the column can be modelled in a categorical,
rather than in a continuous fashion.
With the use of the type_overrides
argument, Synthesized offers the ability to override the default behaviour of both
the MetaExtractor
and the HighDimSynthesizer
to ensure the appropriate behaviour.
Determining default behaviour
Default Metas
To determine if it is necessary to override the default behaviour of the SDK modules, it is often desirable to first discover what that default behaviour is. Information regarding the inferred data types of columns within a dataset are contained within the meta data:
from synthesized import MetaExtractor
df_meta = MetaExtractor.extract(df)
print(df_meta.children)
>>> [<Ring[i8]: IntegerBool(name=SeriousDlqin2yrs)>,
... <Ring[f8]: Float(name=RevolvingUtilizationOfUnsecuredLines)>,
... <Scale[i8]: Integer(name=age)>,
... <Scale[i8]: Integer(name=NumberOfTime30-59DaysPastDueNotWorse)>,
... <Ring[f8]: Float(name=DebtRatio)>]
For the meantime, we’ll ignore the labels Scale
and Ring
and simply focus on the data types for each column.
Focussing on just the inner set of labels and comparing with the raw dataset, the MetaExtractor
has correctly
determined that there are a mixture of Integer
, Float
and IntegerBool
types within the data.
Default Models
Following this, we create a HighDimSynthesizer
instance using df_meta
. The HighDimSynthesizer
object can then be
inspected in order to determine how it will model each column:
from synthesized import HighDimSynthesizer
synth = HighDimSynthesizer(df_meta)
print(synth._df_model.children)
>>> [Histogram(meta=<Ring[i8]: IntegerBool(name=SeriousDlqin2yrs)>),
... KernelDensityEstimate(meta=<Ring[f8]: Float(name=RevolvingUtilizationOfUnsecuredLines)>),
... KernelDensityEstimate(meta=<Scale[i8]: Integer(name=age)>),
... Histogram(meta=<Scale[i8]: Integer(name=NumberOfTime30-59DaysPastDueNotWorse)>),
... KernelDensityEstimate(meta=<Ring[f8]: Float(name=DebtRatio)>)]
For each column, the HighDimSynthesizer
has used the associated meta object to model the variable continuously, using
KernelDensityEstimate
, or categorically, using Histogram
. The synth
object can now be trained in order to generate
synthetic data. Columns modelled continuously may have completely new values generated during this process, while
columns that are modelled categorically will contain values already present in the original data. In both cases, the
appropriate correlations are learnt between the various columns.
Following the procedure to identify the default behaviour of the MetaExtractor
and HighDimSynthesizer
,
the type_overrides
parameter can be used to modify the default behaviour.
For more information on the Meta
and Model
modules see the Reference API docs.