YAML Configuration

YAML (YAML ain’t markdown language) is a human-readable serialization language. It can be used to configure the SDK from the command line with a user defined configuration using -c option,

synthesize -c config.yaml raw_input.csv

where config.yaml is the configuration file and raw_input.csv is the input data. The YAML configuration file can be used to specify the following options and functionalities:

If a YAML configuration is specified during command line synthesis it is automatically validated against a default schema, ensuring that the appropriate keywords and values dtypes have been provided.

Privacy Masking

Privacy Masks are specified using the masking property. In order to apply a mask to a given column or set of columns, the mask in question is provided as a key. The value associated with this key is a list of dictionaries specifying the name of the column to be masked and any additional arguments that are required by the relevant mask. The value of these additional arguments should be given as strings. The masks available for use in command line synthesis and their associated keys used in the YAML configuration are given in the below table:

Mask YAML key

NanMask

nan

RoundingMask

rounding

FormatPreservingMask

format_preserving

In the example below, the NanMask applied to the column column_to_nan while the RoundingMask is applied to column_to_round_0 to create three bins and column_to_round_1 to bin the data into some default number of bins. Finally, the FormatPreservingMask is applied to the column string_column with a given regex pattern specified as the value of the pattern key. While the bins property for the RoundingMask is optional, pattern must be specified for each column to be masked using the FormatPreservingMask.

masking:
  nan:
  - name: column_to_nan
  rounding:
  - name: column_to_round_0
    bins: '3'
  - name: column_to_round_1
    bins: null
  format_preserving:
  - name: string_column
    pattern: '\d{3}'

Meta

The meta keyword can be used to control and override the inferred data types (internally referred to as "metas") used during training of the model and subsequent synthesis. Using a YAML configuration in command line synthesis it is possible to specify the meta of any column in the input dataset. The table below details the possible meta types and their associated keys that can be used in the YAML configuration:

Meta YAML key

String

string

Bool

bool

DateTime

date_time

TimeDelta

time_delta

TimeDeltaDay

time_delta_day

Integer

integer

IntegerBool

integer_bool

Float

float

To specify the meta of a particular column or columns the desired meta type is given as a key, the values of which are a list of the columns to be interpreted using this meta type. In the example below the columns float_column_0 and float_column_1 are to have meta type Float, while int_column_0 and int_column_1 are to be interpreted as meta type Integer.

meta:
  float:
  - float_column_0
  - float_column_1
  integer:
  - int_column_0
  - int_column_1
  string:
  - string_column

For more information on metas and overriding the default behaviour of the SDK see Meta Overrides.

Annotations

Entity Annotation can be used to group together a set of columns in order to treat them as a single entity during training of the model and subsequent synthesis. To annotate a set of columns to be treated as a single entity, the annotations keyword can be used. In order to identify a group of columns as single entity, the desired annotation should be given as a key where the values are a list of dictionaries specifying each unique entity. These dictionaries consist of a name property, representing the name that will be assigned to this unique entity, and a labels property, giving details of the columns that are grouped to form this unique entity. Refer to Entity Annotation for details regarding the labels for each specific annotation type. Note that each entity should have a unique name.

The table below details the possible annotations that can be used in command line synthesis and their associated keywords:

Annotation YAML key

Address

address

Bank

bank

FormattedString

formatted_string

Person

person

The example below demonstrates how columns within a dataset may be treated as an instance of a Person annotation. The columns first_name_0 and last_name_0 are used to specify the first and last name, respectively, of the entity known as person_0, while the columns first_name_1 and last_name_1 are used for person_1.

annotations:
  person:
  - name: person_0
  labels:
      firstname: first_name_0
      lastname: last_name_0
  - name: person_1
  labels:
      firstname: first_name_1
      lastname: last_name_1

Model

The model keyword can be used to override the default method the HighDimSynthesizer uses to model any column in an input dataset. To use a particular model on a column or set of columns the model type in question is given as a key, the values of which are a list of dictionaries specifying the details of the column to be modelled. The column is specified using as the value of the name key, while any optional arguments that can be used when creating a given model can be specified as additional key value pairs.

The table below details the possible model types and their associated YAML key:

Model YAML key

Enumeration

enumeration

Histogram

histogram

KernelDensityEstimate

kernel_density_estimate

In the example below, the column enumeration_column is modelled using Enumeration with a starting value of 200 and step between values of one. A set of categorical columns are modelled using Histogram and a set of continuous columns are modelled with KernelDensityEstimate.

model:
  enumeration:
  - name: enumeration_column
    start: 200
    step: 1
  histogram:
  - name: categorical_column_0
  - name: categorical_column_1
  - name: categorical_column_2
  kernel_density_estimate:
  - name: continuous_column_0
  - name: continuous_column_1

For more information on models and overriding the default behaviour of the SDK see Model Overrides.

HighDimSynthesizer config

Any values set using the HighDimConfig can also be tuned from the command line by using the highdim property in the YAML configuration. For instance, the batch_size and latent_size can be specified as shown below:

highdim:
  batch_size: 128
  latent_size: 16

Note that setting the parameters from the command line is only possible with licences where CONFIGURATION is enabled.

Learn

The number of training steps for training the HighDimSynthesizer can be configured using the learn property in the YAML configuration.

learn:
  num_steps: 1000

Synthesis

The number of rows to synthesize and whether to synthesize NaN s can be configured using the synthesize property in the YAML configuration.

synthesize:
  num_rows: 1000
  produce_nans: True

Rebalancing

Data Rebalancing can be configured using the YAML config with the rebalance keyword. The columns to rebalance are then specified in a list of dictionaries. The name keyword of this dictionary is used to specify the name of the column to be rebalanced. The marginals keyword can then be used to specify the desired marginal distributions of the values within these columns. The value of marginals is a dictionary where the values present in the column are given as keys, the values of which are the desired proportions they should appear in, in the synthetic data.

The below example demonstrates how the value of two columns, fraud_column and sex_column, can be rebalanced. In the case of fraud_column, it is desired that the values False and True appear in equal measure. In the case of sex_column it is desired that 70% of the values should be female and 30% male.

rebalance:
- name: fraud_column
  marginals:
    false: 0.5
    true: 0.5
- name: sex_column
  marginals:
    male: 0.3
    female: 0.7

Rules

To specify Rules for synthesizing data, the rules keyword is used. Currently, of the rules supported by the SDK, only Associations are currently supported by YAML configuration. To specify an association, the association keyword is used. A list of lists, specifying the groups of columns to be associated then follows.

In the example below, the columns car_brand and car_model are to form one association, while the columns country and city are to form another.

rules:
  association:
  - - car_brand
    - car_model
  - - country
    - city