Metas
Metas provide information about the data type of each column of a table as well as what that column represents.
Overriding Default Metas
from synthesized import MetaExtractor
df_meta = MetaExtractor.extract(df)
Whilst the SDK is able to automatically infer the meta information of each column there can be circumstances where default values must be overridden.
from synthesized import MetaExtractor
from synthesized.metadata.values import FloatMeta
float_meta = Float(name="float_col")
df_meta = MetaExtractor.extract(df, type_overrides=[float_meta])
The usage of each Meta
implementation in the SDK is listed below:
String
-
Python
-
YAML
from synthesized.metadata.value import String
str_meta = String(
name="colA"
)
string:
- name: colA
Properties
-
name
: The name of the column that is described by the meta. -
num_rows
(optional): The number of rows of data that the meta describes. If not provided, this will be inferred from the data. -
nan_freq
(optional): The fraction of data that is missing or null. If not provided, this will be inferred from the data. -
categories
(optional): The unique list of strings for the column. If not provided, the categories will be inferred from the data.
Bool
-
Python
-
YAML
from synthesized.metadata.value import Bool
bool_meta = Bool(
name="colA"
)
bool:
- name: colA
Properties
-
name
: The name of the column that is described by the meta. -
num_rows
(optional): The number of rows of data that the meta describes. If not provided, this will be inferred from the data. -
nan_freq
(optional): The fraction of data that is missing or null. If not provided, this will be inferred from the data. -
categories
(optional): List containing one of, or all of [True, False]. If not provided, the categories will be inferred from the data.
Datetime
-
Python
-
YAML
from synthesized.metadata.value import Datetime
dt_meta = Datetime(
name="colA",
date_format="%Y-%m-%d"
)
date_time:
- name: colA
date_format: "%Y-%m-%d"
Properties
-
name
: The name of the column that is described by the meta. -
num_rows
(optional): The number of rows of data that the meta describes. If not provided, this will be inferred from the data. -
nan_freq
(optional): The fraction of data that is missing or null. If not provided, this will be inferred from the data. -
categories
(optional): Unique list of the dates in the column. If not provided, the categories will be inferred from the data. -
date_format
(optional): String representation of date format. If not provided, the date format will be inferred from the data.
Timedelta
-
Python
-
YAML
from synthesized.meta.value import Timedelta
td_meta = Timedelta(
name="colA"
)
time_delta:
- name: "colA"
Properties
-
name
: The name of the column that is described by the meta. -
num_rows
(optional): The number of rows of data that the meta describes. If not provided, this will be inferred from the data. -
nan_freq
(optional): The fraction of data that is missing or null. If not provided, this will be inferred from the data. -
categories
(optional): Unique list of timedeltas in the data. If not provided, the categories will be inferred from the data.
Integer
-
Python
-
YAML
from synthesized.metadata.value import Integer
int_meta = Integer(
name="colA"
)
integer:
- name: colA
Properties
-
name
: The name of the column that is described by the meta. -
num_rows
(optional): The number of rows of data that the meta describes. If not provided, this will be inferred from the data. -
nan_freq
(optional): The fraction of data that is missing or null. If not provided, this will be inferred from the data. -
categories
(optional): Unique list of the integersin the data. If not provided, the categories will be inferred from the data.
Float
-
Python
-
YAML
from synthesized.metadata.value import Float
float_meta = Float(
name="colA",
)
float:
- name: colA
Properties
-
name
: The name of the column that is described by the meta. -
num_rows
(optional): The number of rows of data that the meta describes. If not provided, this will be inferred from the data. -
nan_freq
(optional): The fraction of data that is missing or null. If not provided, this will be inferred from the data. -
categories
(optional): Unique list of the floats in the data. If not provided, the categories will be inferred from the data.
IntegerBool
-
Python
-
YAML
from synthesized.metadata.value import IntegerBool
int_bool_meta = IntegerBool(
name="colA",
)
integer_bool:
- name: colA
Properties
-
name
: The name of the column that is described by the meta. -
num_rows
(optional): The number of rows of data that the meta describes. If not provided, this will be inferred from the data. -
nan_freq
(optional): The fraction of data that is missing or null. If not provided, this will be inferred from the data. -
categories
(optional): List containing one of, or all of [0, 1]. If not provided, the categories will be inferred from the data.