Meta Overrides

In the previous section, it was seen a meta object, df_meta, stores the information regarding the inferred data types that will be used during model training in the children attribute. Using the example dataset from Overrides, calling df_meta.children gives:

>>> [<Ring[i8]: IntegerBool(name=SeriousDlqin2yrs)>,
... <Ring[f8]: Float(name=RevolvingUtilizationOfUnsecuredLines)>,
... <Scale[i8]: Integer(name=age)>,
... <Scale[i8]: Integer(name=NumberOfTime30-59DaysPastDueNotWorse)>,
... <Ring[f8]: Float(name=DebtRatio)>]

For each entry in df_meta.children there are three important features:

  • Name: The column name associated with the given df_meta entry.

  • Abstract data types: In the above example, Ring and Scale are both abstract data types. All concrete data types are implementations of these abstract types, which describe generic properties of the data as well as what kinds of operations can be performed on data values. There are five types of abstract data types implemented in the SDK, in a hierarchical structure

    • Nominal: Categorical data

    • Ordinal: Categorical data that can be sorted

    • Affine: Continuous data where the is the notion of distance between two points

    • Scale: Continuous data types where, as well as subtraction, there exists the notion of addition

    • Ring: Continuous data where multiplication and division are defined

  • Concrete data type: These are the concrete implementations of the above abstract classes, such as IntegerBool, Float and Integer. A full list of concrete data types are given below, including the abstract class they are a member of:

Abstract Concrete

Nominal

String

FormattedString

GroupedString

JSON

Ordinal

OrderedString

Bool

Affine

DateTime

Scale

TimeDelta

TimeDeltaDay

Integer

Ring

Float

IntegerBool

Type overrides

To override the default behaviour, type_overrides can be specified in the MetaExtractor.extract() method. The specified arguments should be meta value objects as detailed in the table above. For instance, using the example dataset from Overrides, if it was desired that age should be interpreted as a float, rather than an integer:

from synthesized.metadata.value import Float
age_float = Float('age')
df_meta = MetaExtractor.extract(df, type_overrides=[age_float])
print(df_meta.children)

>>> [<Ring[i8]: IntegerBool(name=SeriousDlqin2yrs)>,
... <Ring[f8]: Float(name=RevolvingUtilizationOfUnsecuredLines)>,
... <Ring[f8]: Float(name=age)>,
... <Scale[i8]: Integer(name=NumberOfTime30-59DaysPastDueNotWorse)>,
... <Ring[f8]: Float(name=DebtRatio)>]

Note, type_overrides are provided as a list since multiple overrides can be specified at once.