YAML Configuration
YAML (YAML ain’t markdown language) is a human-readable serialization language. It can be used to configure the SDK
from the command line with a user defined configuration using -c
option,
synthesize -c config.yaml raw_input.csv
where config.yaml
is the configuration file and raw_input.csv
is the input data. The YAML configuration file can be used
to specify the following options and functionalities:
If a YAML configuration is specified during command line synthesis it is automatically validated against a default schema, ensuring that the appropriate keywords and values dtypes have been provided.
Privacy Masking
Privacy Masks are specified using the masking
property. In order to apply a mask to a given column
or set of columns, the mask in question is provided as a key. The value associated with this key is a list of
dictionaries specifying the name of the column to be masked and any additional arguments that are required by the
relevant mask. The value of these additional arguments should be given as strings. The masks available for use in command
line synthesis and their associated keys used in the YAML configuration are given in the below table:
Mask | YAML key |
---|---|
|
|
|
|
|
|
In the example below, the NanMask
applied to the column column_to_nan
while the RoundingMask
is
applied to column_to_round_0
to create three bins and column_to_round_1
to bin the data into some default number of
bins. Finally, the FormatPreservingMask
is applied to the column string_column
with a given regex pattern specified
as the value of the pattern
key. While the bins
property for the RoundingMask
is optional, pattern
must be
specified for each column to be masked using the FormatPreservingMask
.
masking:
nan:
- name: column_to_nan
rounding:
- name: column_to_round_0
bins: '3'
- name: column_to_round_1
bins: null
format_preserving:
- name: string_column
pattern: '\d{3}'
Meta
The meta
keyword can be used to control and override the inferred data types (internally referred to as "metas") used
during training of the model and subsequent synthesis. Using a YAML configuration in command line synthesis it is possible
to specify the meta of any column in the input dataset. The table below details the possible meta types and their associated
keys that can be used in the YAML configuration:
Meta | YAML key |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
To specify the meta of a particular column or columns the desired meta type is given as a key, the values of which
are a list of the columns to be interpreted using this meta type. In the example below the columns float_column_0
and
float_column_1
are to have meta type Float
, while int_column_0
and int_column_1
are to be interpreted as meta
type Integer
.
meta:
float:
- float_column_0
- float_column_1
integer:
- int_column_0
- int_column_1
string:
- string_column
For more information on metas and overriding the default behaviour of the SDK see Meta Overrides.
Annotations
Entity Annotation can be used to group together a set of columns in order to treat them as a single entity
during training of the model and subsequent synthesis. To annotate a set of columns to be treated as a single entity,
the annotations
keyword can be used. In order to identify a group of columns as single entity, the desired annotation
should be given as a key where the values are a list of dictionaries specifying each unique entity. These dictionaries
consist of a name
property, representing the name that will be assigned to this unique entity, and a labels
property, giving
details of the columns that are grouped to form this unique entity. Refer to Entity Annotation for details
regarding the labels
for each specific annotation type. Note that each entity should have a unique name.
The table below details the possible annotations that can be used in command line synthesis and their associated keywords:
Annotation | YAML key |
---|---|
|
|
|
|
|
|
|
|
The example below demonstrates how columns within a dataset may be treated as an instance of a Person
annotation. The
columns first_name_0
and last_name_0
are used to specify the first and last name, respectively, of the entity known as person_0
,
while the columns first_name_1
and last_name_1
are used for person_1
.
annotations:
person:
- name: person_0
labels:
firstname: first_name_0
lastname: last_name_0
- name: person_1
labels:
firstname: first_name_1
lastname: last_name_1
Model
The model
keyword can be used to override the default method the HighDimSynthesizer
uses to model any column in an
input dataset. To use a particular model on a column or set of columns the model type in question is given as a key,
the values of which are a list of dictionaries specifying the details of the column to be modelled. The column is specified
using as the value of the name
key, while any optional arguments that can be used when creating a given model can be
specified as additional key value pairs.
The table below details the possible model types and their associated YAML key:
Model | YAML key |
---|---|
|
|
|
|
|
|
In the example below, the column enumeration_column
is modelled using Enumeration
with a starting value of 200 and
step between values of one. A set of categorical columns are modelled using Histogram
and a set of continuous columns
are modelled with KernelDensityEstimate
.
model:
enumeration:
- name: enumeration_column
start: 200
step: 1
histogram:
- name: categorical_column_0
- name: categorical_column_1
- name: categorical_column_2
kernel_density_estimate:
- name: continuous_column_0
- name: continuous_column_1
For more information on models and overriding the default behaviour of the SDK see Model Overrides.
HighDimSynthesizer config
Any values set using the HighDimConfig
can also be tuned from the command line by using the highdim
property in the
YAML configuration. For instance, the batch_size
and latent_size
can be specified as shown below:
highdim:
batch_size: 128
latent_size: 16
Note that setting the parameters from the command line is only possible with licences where CONFIGURATION
is enabled.
Learn
The number of training steps for training the HighDimSynthesizer
can be configured using the learn
property in the YAML configuration.
learn:
num_steps: 1000
Synthesis
The number of rows to synthesize and whether to synthesize NaN
s can be configured using the synthesize
property in the YAML configuration.
synthesize:
num_rows: 1000
produce_nans: True
Rebalancing
Data Rebalancing can be configured using the YAML config with the rebalance
keyword. The columns to rebalance are then specified in a
list of dictionaries. The name
keyword of this dictionary is used to specify the name of the column to be rebalanced.
The marginals
keyword can then be used to specify the desired marginal distributions of the values within these columns.
The value of marginals
is a dictionary where the values present in the column are given as keys, the values of which
are the desired proportions they should appear in, in the synthetic data.
The below example demonstrates how the value of two columns, fraud_column
and sex_column
, can be rebalanced. In the case
of fraud_column
, it is desired that the values False
and True
appear in equal measure. In the case of sex_column
it is
desired that 70% of the values should be female
and 30% male
.
rebalance:
- name: fraud_column
marginals:
false: 0.5
true: 0.5
- name: sex_column
marginals:
male: 0.3
female: 0.7
Rules
To specify Rules for synthesizing data, the rules
keyword
is used. Currently, of the rules supported by the SDK, only Associations are currently
supported by YAML configuration. To specify an association, the association
keyword is used. A list of lists, specifying
the groups of columns to be associated then follows.
In the example below, the columns car_brand
and car_model
are to form one association, while the columns country
and city
are to form another.
rules:
association:
- - car_brand
- car_model
- - country
- city