Configuration File#

To execute a workflow, the user can provide a YAML configuration file to fine-tune parameters of the transformations. This section describes the form of this configuration file.

The configuration file has following sections:

The YAML configuration file has the following structure:

schema_creation_mode: CREATE_IF_NOT_EXISTS | DO_NOT_CREATE | CREATE | DROP_AND_CREATE  # default: CREATE_IF_NOT_EXISTS
table_truncation_mode: DO_NOT_TRUNCATE | TRUNCATE | IGNORE  # default: DO_NOT_TRUNCATE
cycle_resolution_strategy: FAIL | DELETE_NOT_REQUIRED  # default: FAIL
default_config:
  mode: KEEP | MASKING | GENERATION  # default: KEEP
  target_ratio: {float greater or equal than 0}  # default: 1.0
  insert_batch_size: {integer greater or equal then 1}  # default: 30
user_table_configs: {list of user_table_config}
global_seed: {32-bit integer, positive or negative} # default: 0

Schema Creation Mode#

There are four schema creation modes:

  • CREATE_IF_NOT_EXISTS: (default) if this mode is selected, DDL schema will be copied from the source database to the target one if it does not exist, existing schema will be used otherwise.

  • DO_NOT_CREATE: if this mode is selected, the existing schema will be used without any validations. Please use this mode carefully: run-time errors may occur if the input and output schema do not match.

  • CREATE: if this mode is selected, DDL schema will be copied from the source database to the target one. The target database should be empty.

  • DROP_AND_CREATE: if this mode is selected, DDL schema will be copied from the source database to the target one. Existing schema in the target database will be dropped. Please use this mode carefully.

Note: If CREATE_IF_NOT_EXISTS, DO_NOT_CREATE modes are used, the target schema should be equal to the source one.

Cycle Resolution Strategy#

There are two cycle resolution strategies:

  • FAIL: (default) if this mode is selected, cycle_breaker_references should be provided in the configuration file. Otherwise, execution will fail if it detects a circular reference.

  • DELETE_NOT_REQUIRED: if this mode is selected, cyclic references will be resolved automatically by removing the last nullable reference leading to the cycle.

Example for a cycle breaker reference:

schema_creation_mode: CREATE_IF_NOT_EXISTS
cycle_resolution_strategy: FAIL
table_truncation_mode: TRUNCATE
default_config:
    mode: GENERATION
    target_ratio: 1.0
user_table_configs:
  - table_name_with_schema: "employees"
    cycle_breaker_references: ["employees"]

Where the employees table contains a cycle reference.

Global Seed#

An integer 32-bit value between -2147483648 and 2147483647, used a seed for random number generators. The result of generation must be the same each time the generation is being run with the same seed and workflow configuration. By default global_seed is 0.

Example:

default_config:
  mode: "MASKING"
  target_ratio: 1.0
global_seed: 42

Table Truncation Mode#

There are two table truncation modes:

  • DO_NOT_TRUNCATE: (default) if this mode is selected, tables in the target database won’t be truncated. An empty target database required.

  • TRUNCATE: if this mode is selected, tables in the target database will be truncated.

  • IGNORE: if this mode is selected, the status of the target database is ignored.

Default Table Configuration#

The default table configuration is applied to all tables by default. Two parameters can be modified in the default configuration:

  • mode to define how tables are processed by default (see Table Modes).

  • target_ratio: The relative size of the output database with respect to the input. The number of rows of each output table will be computed by multiplying this parameter by the input table size. If not provided, this parameter will be target_ratio = 1, resulting on same size for input and output databases.

The default configuration supports the following structure:

default_config:
  mode: KEEP | MASKING | GENERATION
  target_ratio: {float greater or equal than 0}

User Table Configuration#

The parameters defined in the default configuration are applied to all tables in the database, so there’s no need to configure each table individually. But if needed, the user can override default configuration for any specific table present in the database.

For each table, the user can create a user_table_config and add it to the list user_table_configs. Each table contains the following parameters:

  • table_name_with_schema (required): The name of the table affected by this user_table_config. Must be in format $schema.$table, and the table must exist in the database.

  • mode (optional): The mode of this table, see Table Modes.

  • target_ratio (optional):

  • params (optional): To override default generator and their parameters applied to each column.

Where params contains:

  • columns: List of columns that are affected by this generator

  • params: Parameters of the generator. All parameters have a type key with the type name of the transformation, and other parameters that are transformation-specific. A list of all available transformations are available in Transformations List.

The structure for user_table_configs looks as follows:

user_table_configs:
// First table in the list
- table_name_with_schema: {string with format "$schema.$table"}
  mode: KEEP | MASKING | GENERATION
  target_ratio: {float greater or equal than 0}
  column_params:
  - columns: {list of strings with column names}
    params:
      type: {string with type name of transformation}
      {other key value pairs with transformation-specific parameters}

Table Modes#

There are three table processing modes:

  • KEEP: if this mode is selected, the original data will be copied as it is. When this mode is selected, the output size needs to be smaller than the input, i.e. target_ratio <= 1.

  • MASKING: if this mode is selected, masking transformations will be applied to the original data. When this mode is selected, the output size needs to be smaller than the input, i.e. target_ratio <= 1.

  • GENERATION: if this mode is selected, the synthesized engine will learn the original data and generate new synthetic data. For this mode, the output database can be bigger than the input, so target_ratio can be greater than 1.

Note

Both KEEP and MASKING modes apply a transformation to original data. While KEEP uses passthrough as default transformation, while MASKING automatically assigns a privacy preserving masking transformation to all columns. See transformations list for more details.

For all modes, the user can override default transformers.

Target Ratio#

The size of each output table is computed by multiplying the number of rows of the input table with the target_ratio parameter.

When table mode is KEEP or MASKING the output size needs to be smaller than the input, therefore the target ratio must be lesser than one, i.e. target_ratio <= 1. Setting a target ratio smaller than one is equivalent to subsetting the database, and the Synthesized engine will take care of not breaking referential integrity.

For table mode GENERATION, target ratio can be greater than 1 to generate an output that is greater than the input.

Note

When setting target_ratio at a table level, the result may end up being smaller than the given value due to relationships with parent table.

For example, if a customer table is set to target_ratio = 0.5, and its child table transactions has target_ratio = 1.0, the output transaction table will also end with half it’s samples due to its downstream dependency to the reduced table customer.