Model Overrides

The HighDimSynthesizer module automatically determines how to model data based on the meta data supplied to it. It will consider things such as the data type and the number of unique categories in a column in order to do so. The information on how the HighDimSynthesizer models each column can be accessed through creating a HighDimSynthesizer object and then accessing the _df_models property. Using the example from Overrides: The HighDimSynthesizer module automatically determines how to model data based on the meta data supplied to it and will consider things such as the data type and number of unique categories. The information on how the Synthesizer models each column can be accessed through the _df_models property of an instantiated HighDimSynthesizer object. Using the example from Overrides:

synth = HighDimSynthesizer(df_meta)
print(synth._df_model.children)

>>> [Histogram(meta=<Ring[i8]: IntegerBool(name=SeriousDlqin2yrs)>),
... KernelDensityEstimate(meta=<Ring[f8]: Float(name=RevolvingUtilizationOfUnsecuredLines)>),
... KernelDensityEstimate(meta=<Scale[i8]: Integer(name=age)>),
... Histogram(meta=<Scale[i8]: Integer(name=NumberOfTime30-59DaysPastDueNotWorse)>),
... KernelDensityEstimate(meta=<Ring[f8]: Float(name=DebtRatio)>)]

Each model is linked with the metadata associated with a particular column of the raw dataset. There are three types of models available for columns within the SDK:

  • KernelDensityEstimate: Used to model numeric data types, including datetimes, in a continuous manner. New values not present in the original dataset can be synthesized using this model.

  • Histogram: Used to model discrete/categorical variables, of any data type. Synthesizing data using this model will only generate values that appear in the raw dataset.

  • Enumeration: Used to generate data from a minimum value up to maximum in discrete, constant steps.

To override the default model selection for a particular column, a new model object should be created using the metadata associated with that column. For example, rather than using KernelDensityEstimate, age could be modelled with Histogram in order to treat it as categorical:

from synthesized.model.models import Histogram
age_histogram = Histogram(df_meta.children[2])
synth = HighDimSynthesizer(type_overrides=[age_histogram])
print(synth._df_model.children)

>>> [Histogram(meta=<Ring[i8]: IntegerBool(name=SeriousDlqin2yrs)>),
... KernelDensityEstimate(meta=<Ring[f8]: Float(name=RevolvingUtilizationOfUnsecuredLines)>),
... Histogram(meta=<Scale[i8]: Integer(name=age)>),
... Histogram(meta=<Scale[i8]: Integer(name=NumberOfTime30-59DaysPastDueNotWorse)>),
... KernelDensityEstimate(meta=<Ring[f8]: Float(name=DebtRatio)>)]

Note that the metadata entry corresponding to the age column was passed in as an argument to Histogram and the type_overrides specified in HighDimSynthesizer were given as a list.