Rules
The classes in synthesized.metadata.rules
allow the user to constrain the
synthetic dataset, ensuring it confirms to pre-defined business logic, or a
custom scenario. They can be used with the
ConditionalSampler
to generate custom synthetic data:
from synthesized import ConditionalSampler, MetaExtractor, HighDimSynthesizer
df_meta = MetaExtractor.extract(df, associations=...)
synth = HighDimSynthesizer(df_meta=df_meta)
synth.learn(df)
sampler = ConditionalSampler(synthesizer=synth)
df_synth = sampler.synthesize(num_rows=..., generic_rules=..., expression_rules=...)
Associations
Association rules can only be declared between columns containing categorical variables. |
Often, datasets may contain two columns with an important well-defined relationship. For example:
Make | Model | Total |
---|---|---|
"Ford" |
"Fiesta" |
372013 |
"BMW" |
"M3" |
10342 |
"BMW" |
"X5" |
39753 |
"Volkswagen" |
"Polo" |
87421 |
"Ferrari" |
"California" |
632 |
In the above dataset, "Make" has a one-to-many association with "Model". In
other words, certain categories in "Model" only appear with certain categories
in "Make". The HighDimSynthesizer
captures highly detailed dataset-wide
information, but as it also attempts to generalize specific row-level
information, a case such as "Polo" always appearing with "Volkswagen" isn’t
strictly followed. A possible output of the synthesizer could be:
Make | Model | Total |
---|---|---|
"BMW" |
"X6" |
36382 |
"Ford" |
"Fiesta" |
401877 |
"BMW" |
"Polo" |
67862 |
In this example, the HighDimSynthesizer
has generated a row with a "BMW
Polo", which is an unrealistic combination. If capturing strict column
associations such as this is important, the synthesizer can be configured to do
so by defining an Association
rule.
from synthesized.metadata.rules import Association
rule = Association(associations=["car_manufacturer", "car_model"])
rule.extract(df, df_meta=synth.df_meta)
synth.df_meta.associations.append(rule)
In addition, sometimes empty values are correlated, e.g: if one column specifies the number of children in a family, we would expect that the names of these children to be empty if they don’t exist:
rule = Association(associations=["NumberOfChildren"], nan_associations=["Child1Name", "Child2Name", ...])
rule.extract(df, df_meta=synth.df_meta)
The association class contains a method Association.extract(df, df_meta)
that automatically learns the which values in one column appear with values in another column.
If any particular values of a column never coexist with particular values of another column then it
will force the Synthesizer to never output those values together.
If you create the association class prior to creating the DataFrameMeta object, a convenient way to
extract all the needed information and include it in the DataFrameMeta object is to use the associations
argument of the MetaExtractor.extract
method:
rule = Association(associations=..., nan_association=...)
df_meta = MetaExtractor.extract(df, associations=[rule]) # this will automatically extract the associations
There are some constraints on what rules you can define: the Synthesizer
only allows a column to appear in one association and a column cannot appear in
both the |
Columns and Values
When using rules, we sometimes want to refer to specific columns in a table or to a specific string or numerical value.
To make this clear to the ConditionalSampler
, We use the classes
Column
and Value
.
from synthesized.metadata.rules import Column, Value
column = Column("A") # refers to column A
value = Value(10) # refers to the value 10
These can then be used in expression rules and generic rules.
Expressions
When it is known a priori that a field in a dataset is related to others
through a mathematical transformation, this can be enforced with an
Expression
rule. This takes a string expression that can be parsed by
pandas.eval
from synthesized.metadata.rules import Expression, Column
column = Column("A")
rule = Expression(column=column, expr="a+b+c")
Generic
A GenericRule
is a special type of rule that
can be enforced by conditional sampling of
ConditionalSampler
.
As these rules are enforced by iterative conditional sampling, it may not be
possible to fully generate the desired number of rows if the rules cannot be
fulfilled, or represent a very small proportion of the original data. In this
case, it will return the data it was able to generate. Increasing the
|
ValueRange
ValueRange
can be used to constrain synthesized data to a user-defined range,
either to improve the quality of the synthetic data or to generate custom
scenarios. The upper and lower bounds of the range can be numeric, e.g., '0 < x <
10'.
from synthesized.metadata.rules import ValueRange, Column
column = Column("x")
values = [Value(0), Value(10)]
rule = ValueRange(v1=column, v2=values)
or they can be defined by another field of the dataset, e.g z < x < y
from synthesized.metadata.rules import ValueRange, Column
column_x = Column("x")
bounds = [Column("z"), Column("y")]
rule = ValueRange(v1=column_x, v2=bounds)