Model Overrides
The HighDimSynthesizer
module automatically determines how to model data based on the meta data supplied to it. It
will consider things such as the data type and the number of unique categories in a column in order to do so. The
information on how the HighDimSynthesizer
models each column can be accessed through creating a HighDimSynthesizer
object and then accessing the _df_models
property. Using the example from Overrides:
The HighDimSynthesizer
module automatically determines how to model data based on the meta data supplied to it and will consider things such as the data type and number of unique categories. The information on how the Synthesizer models each column can be accessed through the _df_models
property of an instantiated HighDimSynthesizer
object. Using the example from Overrides:
synth = HighDimSynthesizer(df_meta)
print(synth._df_model.children)
>>> [Histogram(meta=<Ring[i8]: IntegerBool(name=SeriousDlqin2yrs)>),
... KernelDensityEstimate(meta=<Ring[f8]: Float(name=RevolvingUtilizationOfUnsecuredLines)>),
... KernelDensityEstimate(meta=<Scale[i8]: Integer(name=age)>),
... Histogram(meta=<Scale[i8]: Integer(name=NumberOfTime30-59DaysPastDueNotWorse)>),
... KernelDensityEstimate(meta=<Ring[f8]: Float(name=DebtRatio)>)]
Each model is linked with the metadata associated with a particular column of the raw dataset. There are three types of models available for columns within the SDK:
-
KernelDensityEstimate
: Used to model numeric data types, including datetimes, in a continuous manner. New values not present in the original dataset can be synthesized using this model. -
Histogram
: Used to model discrete/categorical variables, of any data type. Synthesizing data using this model will only generate values that appear in the raw dataset. -
Enumeration
: Used to generate data from a minimum value up to maximum in discrete, constant steps.
To override the default model selection for a particular column, a new model object should be created using the metadata
associated with that column. For example, rather than using KernelDensityEstimate
, age
could be modelled with
Histogram
in order to treat it as categorical:
from synthesized.model.models import Histogram
age_histogram = Histogram(df_meta.children[2])
synth = HighDimSynthesizer(type_overrides=[age_histogram])
print(synth._df_model.children)
>>> [Histogram(meta=<Ring[i8]: IntegerBool(name=SeriousDlqin2yrs)>),
... KernelDensityEstimate(meta=<Ring[f8]: Float(name=RevolvingUtilizationOfUnsecuredLines)>),
... Histogram(meta=<Scale[i8]: Integer(name=age)>),
... Histogram(meta=<Scale[i8]: Integer(name=NumberOfTime30-59DaysPastDueNotWorse)>),
... KernelDensityEstimate(meta=<Ring[f8]: Float(name=DebtRatio)>)]
Note that the metadata entry corresponding to the age
column was passed in as an argument to Histogram
and the
type_overrides
specified in HighDimSynthesizer
were given as a list.