YAML Configuration
YAML (YAML ain’t markdown language) is a human-readable serialization language. It can be used to configure the SDK
from the Command Line Interface with a user defined configuration using the -c option,
synthesize -c config.yaml raw_input.csv
where config.yaml is the configuration file and raw_input.csv is the input data.
The YAML configuration can be used to specify nearly all of the publicly available functionalities of the SDK for tabular data accessible through the Python API. These functionalities can be roughly divided into three categories:
-
Single Column specifications, i.e. configuring actions that concern a single column
-
Multiple Column specifications, i.e. configuring actions that concern multiple columns
-
Model Configuration for training and synthesis
Each of the listed functionalities is specified with an associated keyword, as listed in the table below:
| Function | YAML key |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
If a YAML configuration is specified during command line synthesis it is automatically validated against a default schema, ensuring that the appropriate keywords and values dtypes have been provided.
|
It’s also possible to generate a default config for any given dataset using the
A config file, in this case |
Single Column
Single column actions can be specified by following a very similar trend. The action/functionality (e.g. masking) is specified
as a top level keyword in the config YAML file. The set of transformations to be implemented are then specified as nested
keys beneath. Beneath each of these keys are a list of dictionaries containing the names of the columns to be acted upon
by the particular transformation, as well as any required or optional arguments. The name of the column is always specified
using the name key.
Below are examples of this for each single column action.
Privacy Masking
Masks are specified using the masking property. In order to apply a mask to a given column
or set of columns, the mask in question is provided as a key. The masks available for use in command line synthesis and
their associated keys used in the YAML configuration are given in the below table:
| Mask | YAML key |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
The syntax for applying the four masks is given in the example below. For each column to be masked, a dictionary is
supplied where the name key specifies the name of the column. Additional keyword arguments can also be specified in each
column dictionary. While the bins property for the RoundingMask and the key property for the HashingMask are
optional, pattern must be specified for each column to be masked using the FormatPreservingMask.
masking:
nan:
- name: column_to_nan
rounding:
- name: column_to_round_0
bins: '3'
- name: column_to_round_1
format_preserving:
- name: string_column
pattern: '\d{3}'
hashing:
- name: id_column
seed: secret123
Meta
The meta keyword can be used to control and override the inferred data types
(internally referred to as "metas") used during training of the model and subsequent synthesis. Using a YAML
configuration in command line synthesis it is possible to specify the meta of any column in the input dataset. The table
below details the possible meta types and their associated keys that can be used in the YAML configuration:
| Meta | YAML key |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The below example demonstrates how to specify the meta overrides for the named columns. Meta types are nested keys,
and the columns are lists beneath these keys. If a column in the dataframe
is not specified in the meta section of the YAML configuration the column will still be used to train the
HighDimSynthesizer, but where the meta of the column is determined by an automatic inference procedure.
meta:
float:
- name: float_column_0
- name: float_column_1
integer:
- name: int_column_0
- name: int_column_1
string:
- name: string_column
Annotations
Entity Annotation can be used to group together a set of columns in order to treat them as a single entity
during training of the model and subsequent synthesis. To annotate a set of columns to be treated as a single entity,
the annotations keyword can be used. In order to identify a group of columns as single entity, the desired annotation
should be given as a key where the values are a list of dictionaries specifying each unique entity. These dictionaries
consist of a name property, representing the name that will be assigned to this unique entity, and a labels property, giving
details of the columns that are grouped to form this unique entity. Refer to Entity Annotation for details
regarding the labels for each specific annotation type. Note that each entity should have a unique name.
The table below details the possible annotations that can be used in command line synthesis and their associated keywords:
| Annotation | YAML key |
|---|---|
|
|
|
|
|
|
|
|
|
|
The example below demonstrates how columns within a dataset may be treated as an instance of a Person annotation. The
columns first_name_0 and last_name_0 are used to specify the first and last name, respectively, of the entity known as person_0,
while the columns first_name_1 and last_name_1 are used for person_1. An additional Company annotation
is also specified, describing the columns company_name and country as a single entity.
annotations:
person:
- name: person_0
labels:
firstname: first_name_0
lastname: last_name_0
- name: person_1
labels:
firstname: first_name_1
lastname: last_name_1
company:
- name: company_0
labels:
full_name: company_name
country: country
locales:
- en_GB
- fr_FR
Model
The model keyword can be used to override the default method the HighDimSynthesizer uses to model any column in an
input dataset. To use a particular model on a column or set of columns the model type in question is given as a key,
the values of which are a list of dictionaries specifying the details of the column to be modelled. The column is specified
using as the value of the name key, while any optional arguments that can be used when creating a given model can be
specified as additional key value pairs.
The table below details the possible model types and their associated YAML key:
| Model | YAML key |
|---|---|
|
|
|
|
|
|
The example below demonstrates the syntax required to specify columns as each of the three model types. It is required to specify the start and
stop keywords for Enumeration models.
model:
enumeration:
- name: enumeration_column
start: 200
step: 1
histogram:
- name: categorical_column_0
- name: categorical_column_1
- name: categorical_column_2
kernel_density_estimate:
- name: continuous_column_0
- name: continuous_column_1
For more information on models and overriding the default behaviour of the SDK see Overrides.
Multiple Column
Currently, Rules are the only multi-column transformation that can be applied through YAML configuration. Additional
multi-column transformations will be added.
Rules
To specify Rules for synthesizing data, the rules keyword
is used. Currently, of the rules supported by the SDK, only Associations are currently
supported by YAML configuration. To specify an association, the association keyword is used. A list of lists, specifying
the groups of columns to be associated then follows, together with an optional string allocated_memory giving the (rough)
amount of memory the association is allowed to use.
In the example below, the columns car_brand and car_model are to form one association, while the columns country
and city are to form another. The maximum memory allowed to the association is (roughly) 2gb.
rules:
association:
- - car_brand
- car_model
- - country
- city
allocated_memory: 2gb
Model Configuration
Model configuration concerns the configuration of the HighDimSynthesizer during training, the number of training steps
to use and the synthetic data to be output from a trained model.
Learn Config
Any values set using the HighDimConfig can also be tuned from the command line by using the learn property in the
YAML configuration. For instance, the batch_size and latent_size can be specified as shown below:
learn:
batch_size: 128
latent_size: 16
|
Note that setting the parameters from the command line is only possible with licences where |
The number of training steps for training the HighDimSynthesizer can also be configured using the learn property in the YAML configuration.
learn:
num_steps: 1000
Synthesis
The number of rows to synthesize and whether to synthesize NaN s can be configured using the synthesize property in the YAML configuration.
synthesize:
num_rows: 1000
produce_nans: True
Data Rebalancing can be configured using the YAML config with the rebalance keyword
in the synthesize section. The columns to rebalance are then specified in a list of dictionaries. The name
keyword of this dictionary is used to specify the name of the column to be rebalanced.
The marginals keyword can then be used to specify the desired marginal distributions of the values within these columns.
The value of marginals is a dictionary where the values present in the column are given as keys, the values of which
are the desired proportions they should appear in, in the synthetic data.
The below example demonstrates how to rebalance two columns, fraud_column and sex_column.
synthesize:
rebalance:
- name: fraud_column
marginals:
false: 0.5
true: 0.5
- name: sex_column
marginals:
male: 0.3
female: 0.7