Rules#
The classes in synthesized.common.rules
allow the user to constrain the synthetic dataset, ensuring it confirms to
pre-defined business logic, or a custom scenario. They can be used with the ConditionalSampler
to generate custom synthetic data:
In [1]: sampler = ConditionalSampler(synthesizer=...)
In [2]: df_synth = sampler.synthesize(num_rows=..., association_rules=..., generic_rules=..., expression_rules=...)
Associations#
Often, datasets may contain two columns with an important well-defined relationship. For example:
Make |
Model |
Total |
---|---|---|
Ford |
Fiesta |
372013 |
BMW |
M3 |
10342 |
BMW |
X5 |
39753 |
Volkswagen |
Polo |
87421 |
Ferrari |
California |
632 |
In the above dataset, “Make” has a one-to-many association with “Model”. In other words, certain categories in “Model”
only appear with certain categories in “Make”. The HighDimSynthesizer
captures highly
detailed dataset-wide information, but as it also attempts to generalize specific row-level information, a case such as
“Polo” always appearing with “Volkswagen” isn’t strictly followed. A possible output of the synthesizer could be:
Make |
Model |
Total |
---|---|---|
BMW |
X6 |
36382 |
Ford |
Fiesta |
401877 |
BMW |
Polo |
67862 |
In this example, the HighDimSynthesizer
has generated a row with a “BMW Polo”, which is an
unrealistic combination. If capturing strict column associations such as this is important, the synthesizer can be
configured to do so by defining an Association
rule
In [3]: rule = Association.detect_association(df, df_meta, associations=["car_manufacturer", "car_model"])
In addition, sometimes empty values are correlated, e.g: if one column specifies the number of children in a family, we would expect that the names of these children to be empty if they don’t exist:
In [4]: rule = Association.detect_association(df, df_meta, associations=["NumberOfChildren"], nan_associations=["Child1Name", "Child2Name", ...])
The association class contains a class method detect_association()
that automatically detects these rules between the columns,
if some category of a column never appears with another then it can force the Synthesizer to never output those values together.
However, if a specific rule is required that isn’t present in the data the Association can be initialized on its own.
In [5]: rule = Association(binding_mask=binding_mask, associations=..., nan_association=...)
Here the binding mask
specifies the possible outputs of the Synthesizer, this isn’t currently user-friendly to construct due to its lack of use-case.
There are some constraints on what rules you can define — the Synthesizer only allows a column to appear in one association
and a column cannot appear in both the association
and nan_association
arguments.
Some of these constraints may be possible to change in the future.
Columns and Values#
When using rules, we sometimes want to refer to specific columns in a table or to a specific string or numerical value.
To make this clear to the ConditionalSampler
, We use the classes
Column
and Value
.
In [6]: column = Column("A") # refers to column A
In [7]: value = Value(10) # refers to the value 10
These can then be used in expression rules and generic rules.
Expressions#
When it is known a priori that a field in a dataset is related to others through a mathematical transformation, this can
be enforced with an Expression
rule. This takes a string expression that can be
parsed by pandas.eval:
In [8]: column = Column("A")
In [9]: rule = Expression(column=column, expr="a+b+c")
Generic#
A GenericRule
is a special type of rule that can be enforced by conditional sampling
of ConditionalSampler
.
Warning
As these rules are enforced by iterative conditional sampling, it may not be possible to fully generate the desired
number of rows if the rules cannot be fulfilled, or represent a very small proportion of the original data. In this
case, it will return the data it was able to generate. Increasing the max_trials
parameter may resolve this
issue.
ValueRange#
ValueRange
can be used to constrain synthesized data to a user-defined range, either to improve the quality of the synthetic data
or to generate custom scenarios. The upper and lower bounds of the range can be numeric, e.g ‘0 < x < 10:
In [10]: column = Column("x")
In [11]: values = [Value(0), Value(10)]
In [12]: rule = ValueRange(v1=column, v2=values)
or they can be defined by another field of the dataset, e.g z < x < y
In [13]: column_x = Column("x")
In [14]: bounds = [Column("z"), Column("y")]
In [15]: rule = ValueRange(v1=column_x, v2=bounds)
Equals#
Equals
enforces the field of a dataset to be strictly equal to a specified value, either numeric or categorical.
In [16]: rule = ValueEquals(name="x", value='A')
IsIn#
IsIn
is similar to Equals
, but specifies a list of allowed values.
In [17]: rule = ValueEquals(name="x", values=['A', 'B'])