Models

For data with a given Meta type, Models describe how we choose to model and generate data.

Overriding Default Models

The HighDimSynthesizer automatically determines the manner in which each column is modelled. The default behaviour can be overriden.

from synthesized import HighDimSynthesizer
from synthesized.model.models import Histogram

categorical_model = Histogram(meta=df_meta["categorical_col"])
synth = HighDimSynthesizer(df_meta, type_overrides=[categorical_model])

The usage of each Model implementation in the SDK is listed below:

Enumeration

Used to generate data from a minimum value up to maximum in discrete, constant steps.

  • Python

  • YAML

from synthesized.model.model import Enumeration

enum_model = Enumeration(
    meta=df_meta["colA"],
    start=1,
    step=1
)

Properties

  • meta: The meta of the column that is being modelled.

  • start (optional): Value to start the enumeration from. If not provided, the minimum will be inferred from the meta.

  • step (optional): Step size of the enumeration. If not provided, inferred from the meta.

enumeration:
  - name: "colA"
    start: 1
    step: 1

Properties

  • name: The name of the meta that is being modelled.

  • start (optional): Value to start the enumeration from. If not provided, the minimum will be inferred from the meta.

  • step (optional): Step size of the enumeration. If not provided, inferred from the meta.

KernelDensityEstimate

Used to model numeric data types, including datetimes, in a continuous manner.

  • Python

  • YAML

from synthesized.model.model import KernelDensityEstimate

kde_model = KernelDensityEstimate(
    meta=df_meta["colA"],
)

Properties

  • meta: The meta of the column that is being modelled.

kernel_density_estimate:
  - name: "colA"

Properties

  • name: The name of the meta that is being modelled.

Histogram

Used to model discrete/categorical variables, of any data type.

  • Python

  • YAML

from synthesized.model.model import Histogram

hist_model = Histogram(
    meta=df_meta["colA"],
)

Properties

  • meta: The meta of the column that is being modelled.

  • probabilities (optional): Probability distribution of categories. Empty dict until fit is called.

histogram:
  - name: "colA"

Properties

  • name: The name of the meta that is being modelled.

  • probabilities (optional): Probability distribution of categories. Empty dict until fit is called.