Rules

The classes in synthesized.metadata.rules allow the user to constrain the synthetic dataset, ensuring it confirms to pre-defined business logic, or a custom scenario. They can be used with the ConditionalSampler to generate custom synthetic data:

from synthesized import ConditionalSampler, MetaExtractor, HighDimSynthesizer

df = ...
df_meta = MetaExtractor.extract(df)
synth = HighDimSynthesizer(df_meta=df_meta)
synth.learn(df)

synth.df_meta.associations.append(...)  (1)
sampler = ConditionalSampler(synthesizer=synth)
df_synth = sampler.synthesize(
    num_rows=...,
    expression_rules=[...],  (2)
    generic_rules=[...]  (3)
)
1 Strict associations between categorical columns can be enforced.
2 Expressions can be used to describe one column as a mathematical transformation of other columns.
3 Prescribing generic rules and relationships for one or more columns is also possible.

Associations

Associating Column Values

Association rules can only be declared between columns containing categorical variables.

The Synthesized SDK is designed to automatically detect and manage relationships between columns and values. When columns are related, the synthesized data will aim to follow that pattern and produce results that match the input data. As most data sets do not include every possible permutation of data, the SDK fuzzes data to allow other permutations to appear with lower likelihood. While this is sufficient for the vast majority of cases, some datasets contain columns with important relationships that can’t be broken.

Table 1. fruit-original.csv
Fruit Color Total

"Strawberry"

"Red"

372013

"Apple"

"Red"

10342

"Apple"

"Green"

39753

"Lime"

"Green"

87421

"Banana"

"Yellow"

632

In the above dataset, "Fruit" has an association with "Color". In other words, certain categories in "Fruit" only appear with certain categories in "Color".

The HighDimSynthesizer captures highly detailed dataset-wide information. As it attempts to generalize specific row-level information, a case such as "Yellow" always appearing with "Banana" isn’t strictly followed. A possible output of the synthesizer could be:

Table 2. fruit-synthetic.csv
Fruit Color Total

"Lime"

"Red"

67862

"Apple"

"Green"

36382

"Strawberry"

"Red"

401877

In this example, the HighDimSynthesizer has generated a row with a Red Lime, which is an unrealistic combination. If capturing strict column associations such as this is important, the synthesizer can be configured to do so by defining an Association rule.

from synthesized import HighdimSynthesizer, MetaExtractor
from synthesized.metadata.rules import Association

df_meta = MetaExtractor.extract(df)
synth = HighDimSynthesizer(df_meta=df_meta)
synth.learn(df)

rule = Association(associations=["car_manufacturer", "car_model"])
rule.extract(df, df_meta=df_meta)
synth.df_meta.associations.append(rule)

synth.synthesize(10)

If you create the association class prior to creating the DataFrameMeta object, a convenient way to extract all the needed information and include it in the DataFrameMeta object is to use the associations argument of the MetaExtractor.extract method:

rule = Association(associations=..., nan_associations=...,allocated_memory=...)
df_meta = MetaExtractor.extract(df, associations=[rule])  (1)
1 This will automatically extract the associations.

The association object must call the method .extract(df, df_meta) in order to learn which values in one column appear with which values in another column. If any particular values of a column never coexist with particular values of another column then it will association rule the Synthesizer to never output those values together.

The method .extract(df, df_meta) assesses the approximate amount of memory needed by the association and raises a memory error if this exceeds the parameter allocated_memory, which should be a string comprising a number and a unit 'b','kb', mb','gb','tb'.

There are some constraints on what rules you can define: the HighDimSynthesizer requires requires that every column appears in no more than one association object. In addition, for each association object, a column cannot appear in both the associations and nan_associations arguments.

Missing Data

An empty or null value can represent either missing data or a valid option. By default, the Synthesized SDK follows the pandas approach that that NA values are missing values. That is, there should be a value, but it is unknown. As such, the SDK assumes such values could be any other value and ignore them when creating associations.

Dependent Missing Data

In some cases empty values are valid input. In addition, sometimes empty values are correlated, e.g: if one column specifies the number of children in a family, we would expect that the names of these children to be empty if they don’t exist.

The Synthesized SDK handles this using Association s that contain nan_associations. A nan_association links the nan-associated column with the associations, so that the SDK learns when the nan_association should be NA or not. For example:

rule = Association(associations=["NumberOfChildren"], nan_associations=["Child1Name", "Child2Name", ...], allocated_memory='1gb')
rule.extract(df, df_meta=synth.df_meta)

Importantly, a nan_association only learns about when NA is valid, they do not associate other values. In the above example, particular names are not linked to how many children are present.

Categorical Missing Data

In some cases, empty values are not only valid input, but present as part of an association.

Table 3. projects.csv

Name

Client

Team

"Alice"

"QA"

"Bob"

"Foo"

"QA"

"Charlie"

"Foo"

"QA"

"David"

"Bar"

"Engineering"

"Eva"

"Bar"

"Frank"

"Bar"

"Engineering"

In this example certain clients have certain teams. Some people are generalists and aren’t assigned to a particular team, and some people are assigned to a client but not a team.

If there are business reasons why these relationships need to be maintained exactly, use the NanAsValidCategory context manager. This feature allows you to declare empty values as valid results across the input, or for specific columns.

NanAsValidCategory is a context manager that wraps around your existing code calling the MetaExtractor and HighDimSynthesizer.

with NanAsValidCategory():
  df_meta = MetaExtractor.extract(df, associations=[Association(associations=["client", "team"])])
  synth = HighDimSynthesizer(df_meta)
  synth.learn(df)

By using the NanAsValidCategory context manager, you ensure that the associations inside it will include empty values as valid categories.

If you only need certain associations to allow empty values, supply those associations to NanAsValidCategory.

associations=[Association(associations=["client", "team"])]
with NanAsValidCategory(associations):
  df_meta = MetaExtractor.extract(df, associations=associations)
  synth = HighDimSynthesizer(df_meta)
  synth.learn(df)

A nan_association can NOT be used inside an unrestricted NanAsValidCategory. With NanAsValidCategory active, all NA values are treated valid options, while a nan_association treats them as exceptions. To use them together, specify which associations should have NanAsValidCategory.

Columns and Values

When using rules, we sometimes want to refer to specific columns in a table or to a specific string or numerical value. To make this clear to the ConditionalSampler, we use the classes Column and Value.

from synthesized.metadata.rules import Column, Value
column = Column("A")  (1)
value = Value(10)  (2)
1 Refers to column A.
2 Refers to the value 10.

These can then be used in expression rules and generic rules.

Expressions

When it is known a priori that a field in a dataset is related to others through a mathematical transformation, this can be enforced with an Expression rule. This takes a string expression that can be parsed by pandas.eval

from synthesized import ConditionalSampler, HighdimSynthesizer, MetaExtractor
from synthesized.metadata.rules import Column, Expression

df_meta = MetaExtractor.extract(df)
synth = HighDimSynthesizer(df_meta=df_meta)
synth.learn(df)
sampler = ConditionalSampler(synth)

column = Column("A")
rule = Expression(column=column, expr="a+b+c")
rule.set_meta(df_meta)

sampler.synthesize(num_rows=10, expression_rules=[rule])

Generic

A GenericRule is a special type of rule that can be enforced by conditional sampling of ConditionalSampler.

As these rules are enforced by iterative conditional sampling, it may not be possible to fully generate the desired number of rows if the rules cannot be fulfilled, or represent a very small proportion of the original data. In this case, it will return the data it was able to generate. Increasing the max_trials parameter may resolve this issue.

Equals

Equals enforces the field of a dataset to be strictly equal to a specified value. Unlike the Expression rule Equals can refer to either numeric or categorical.

from synthesized import ConditionalSampler, HighdimSynthesizer, MetaExtractor
from synthesized.metadata.rules import Column, Equals

df_meta = MetaExtractor.extract(df)
synth = HighDimSynthesizer(df_meta=df_meta)
synth.learn(df)
sampler = ConditionalSampler(synth)

column_x = Column("x")
column_A = Column("A")
rule = Equals(column_x, column_A)
rule.set_meta(df_meta)

sampler.synthesize(num_rows=10, generic_rules=[rule])

IsIn

IsIn is similar to Equals, but specifies a list of allowed values.

from synthesized import ConditionalSampler, HighdimSynthesizer, MetaExtractor
from synthesized.metadata.rules import Column, IsIn, Value

df_meta = MetaExtractor.extract(df)
synth = HighDimSynthesizer(df_meta=df_meta)
synth.learn(df)
sampler = ConditionalSampler(synth)

column = Column("x")
values = [Value("A"), Value("B")]
rule = IsIn(column, values)
rule.set_meta(df_meta)

sampler.synthesize(num_rows=10, generic_rules=[rule])

ValueRange

ValueRange can be used to constrain synthesized data to a user-defined range, either to improve the quality of the synthetic data or to generate custom scenarios. The upper and lower bounds of the range can be numeric, e.g., "0 < x < 10".

from synthesized import ConditionalSampler, HighdimSynthesizer, MetaExtractor
from synthesized.metadata.rules import Column, Value, ValueRange

df_meta = MetaExtractor.extract(df)
synth = HighDimSynthesizer(df_meta=df_meta)
synth.learn(df)
sampler = ConditionalSampler(synth)

column = Column("x")
values = [Value(0), Value(10)]
rule = ValueRange(v1=column, v2=values)
rule.set_meta(df_meta)

sampler.synthesize(num_rows=10, generic_rules=[rule])

or they can be defined by another field of the dataset, e.g., "z < x < y".

from synthesized import ConditionalSampler, HighdimSynthesizer, MetaExtractor
from synthesized.metadata.rules import Column, ValueRange

df_meta = MetaExtractor.extract(df)
synth = HighDimSynthesizer(df_meta=df_meta)
synth.learn(df)
sampler = ConditionalSampler(synth)

column_x = Column("x")
bounds = [Column("z"), Column("y")]
rule = ValueRange(v1=column_x, v2=bounds)
rule.set_meta(df_meta)

sampler.synthesize(num_rows=10, generic_rules=[rule])