Metas

Metas provide information about the data type of each column of a table as well as what that column represents.

Overriding Default Metas

from synthesized import MetaExtractor

df_meta = MetaExtractor.extract(df)

Whilst the SDK is able to automatically infer the meta information of each column there can be circumstances where default values must be overridden.

from synthesized import MetaExtractor
from synthesized.metadata.values import FloatMeta

float_meta = Float(name="float_col")
df_meta = MetaExtractor.extract(df, type_overrides=[float_meta])

The usage of each Meta implementation in the SDK is listed below:

String

  • Python

  • YAML

from synthesized.metadata.value import String

str_meta = String(
    name="colA"
)
string:
  - name: colA

Properties

  • name: The name of the column that is described by the meta.

  • num_rows (optional): The number of rows of data that the meta describes. If not provided, this will be inferred from the data.

  • nan_freq (optional): The fraction of data that is missing or null. If not provided, this will be inferred from the data.

  • categories (optional): The unique list of strings for the column. If not provided, the categories will be inferred from the data.

Bool

  • Python

  • YAML

from synthesized.metadata.value import Bool

bool_meta = Bool(
    name="colA"
)
bool:
  - name: colA

Properties

  • name: The name of the column that is described by the meta.

  • num_rows (optional): The number of rows of data that the meta describes. If not provided, this will be inferred from the data.

  • nan_freq (optional): The fraction of data that is missing or null. If not provided, this will be inferred from the data.

  • categories (optional): List containing one of, or all of [True, False]. If not provided, the categories will be inferred from the data.

Datetime

  • Python

  • YAML

from synthesized.metadata.value import Datetime

dt_meta = Datetime(
    name="colA",
    date_format="%Y-%m-%d"
)
date_time:
  - name: colA
    date_format: "%Y-%m-%d"

Properties

  • name: The name of the column that is described by the meta.

  • num_rows (optional): The number of rows of data that the meta describes. If not provided, this will be inferred from the data.

  • nan_freq (optional): The fraction of data that is missing or null. If not provided, this will be inferred from the data.

  • categories (optional): Unique list of the dates in the column. If not provided, the categories will be inferred from the data.

  • date_format (optional): String representation of date format. If not provided, the date format will be inferred from the data.

Timedelta

  • Python

  • YAML

from synthesized.meta.value import Timedelta

td_meta = Timedelta(
    name="colA"
)
time_delta:
  - name: "colA"

Properties

  • name: The name of the column that is described by the meta.

  • num_rows (optional): The number of rows of data that the meta describes. If not provided, this will be inferred from the data.

  • nan_freq (optional): The fraction of data that is missing or null. If not provided, this will be inferred from the data.

  • categories (optional): Unique list of timedeltas in the data. If not provided, the categories will be inferred from the data.

Integer

  • Python

  • YAML

from synthesized.metadata.value import Integer

int_meta = Integer(
    name="colA"
)
integer:
  - name: colA

Properties

  • name: The name of the column that is described by the meta.

  • num_rows (optional): The number of rows of data that the meta describes. If not provided, this will be inferred from the data.

  • nan_freq (optional): The fraction of data that is missing or null. If not provided, this will be inferred from the data.

  • categories (optional): Unique list of the integersin the data. If not provided, the categories will be inferred from the data.

Float

  • Python

  • YAML

from synthesized.metadata.value import Float

float_meta = Float(
    name="colA",
)
float:
  - name: colA

Properties

  • name: The name of the column that is described by the meta.

  • num_rows (optional): The number of rows of data that the meta describes. If not provided, this will be inferred from the data.

  • nan_freq (optional): The fraction of data that is missing or null. If not provided, this will be inferred from the data.

  • categories (optional): Unique list of the floats in the data. If not provided, the categories will be inferred from the data.

IntegerBool

  • Python

  • YAML

from synthesized.metadata.value import IntegerBool

int_bool_meta = IntegerBool(
    name="colA",
)
integer_bool:
  - name: colA

Properties

  • name: The name of the column that is described by the meta.

  • num_rows (optional): The number of rows of data that the meta describes. If not provided, this will be inferred from the data.

  • nan_freq (optional): The fraction of data that is missing or null. If not provided, this will be inferred from the data.

  • categories (optional): List containing one of, or all of [0, 1]. If not provided, the categories will be inferred from the data.