YAML configuration

User single database config

Object.
User configuration used when working with a single database

Properties

  • global_seed: Integer.
    A value used a seed for random number generators. The result of generation must be the same each time the generation is being run with the same seed and workflow configuration. By default global_seed is 0.

  • schemas: array of String.
    List of schemas to process. If not set or null, all available schemas will be processed.

User multiple databases config

Object.
User configuration used when working with multiple databases

Properties

Default configuration

Used in: default_config

optional Object.
Consists of the default table configuration parameters that are applied to all tables by default and a list of rules that are conditionally applied to the table.

Properties

  • target_ratio: optional Number (double).
    The relative size of each table of the output database with respect to the input. The number of rows of each output table will be computed by multiplying this parameter by the input table size. If not provided, this parameter will be target_ratio = 1, resulting on same size for input and output databases. Can be overridden by target_row_number

  • target_row_number: optional Integer (int64).
    The absolute size of each table of the output database in rows. This parameter is applicable only for GENERATION mode. If not provided, target_ratio will be used.

  • insert_batch_size: optional Integer.
    Indicates how many table rows are inserted to the database per each batch operation.

  • use_working_directory: Boolean.
    Whether to use the configured local working directory to perform certain I/O operations more efficiently. In some cases, using working directory can speed up table insertion speed by several times. However, the more tables use the working directory, the more space is needed for it. Note that this setting is only effective when the working directory is configured globally (see Working directory properties).

The rules used to select the appropriate transformation depending on the database metadata if one is not specified in the user’s configuration.

User-defined table configuration

Used in: tables

Object.
The parameters defined in the default configuration are applied to all tables in the database, so there’s no need to configure each table individually. But if needed, the user can override default configuration for any specific table present in the database. For each table, the user can create a UserTableConfig and add it to the list tables of YAML configuration.

Properties

  • table_name_with_schema: String.
    The name of the table affected by this UserTableConfig. Must be in format $schema.$table, and the table must exist in the database.

  • target_ratio: optional Number (double).
    The relative size of the output table with respect to the input. The number of rows of each output table will be computed by multiplying this parameter by the input table size. If not provided, this parameter will be target_ratio = 1, resulting on same size for input and output databases. Can be overridden by target_row_number

When setting target_ratio at a table level, the result may end up being smaller than the given value due to relationships with parent table. For example, if a customer table is set to target_ratio = 0.5, and its child table transactions has target_ratio = 1.0, the output transaction table will also end with half it’s samples due to its downstream dependency to the reduced table customer.
  • target_row_number: optional Integer (int64).
    The absolute size of the output table in rows. This parameter is applicable only for GENERATION mode. If not provided, target_ratio will be used.

  • cycle_breaker_references: array of String.
    When Cycle resolution strategy is FAIL, this list may contain a list of table names, references to which will be ignored during the data generation.

  • insert_batch_size: optional Integer.
    Indicates how many table rows are inserted to the database per each batch operation.

  • use_working_directory: Boolean.

  • filter: String.
    An expression to filter a source table data. Any SQL expression that can be used as the search condition in WHERE clause. For example:

org_id != 120 and custom_function() = 'foo'

Cycle resolution strategy

optional String.
Defines how to deal with cycles in table relations via foreign keys.

FAIL

if this mode is selected, cycle_breaker_references should be provided in the configuration file. Otherwise, execution will fail if it detects a circular reference.

DELETE_NOT_REQUIRED

if this mode is selected, cyclic references will be resolved automatically by removing the last nullable reference leading to the cycle.

Example for a cycle breaker reference:

schema_creation_mode: CREATE_IF_NOT_EXISTS
cycle_resolution_strategy: FAIL
table_truncation_mode: TRUNCATE
safety_mode: "RELAXED"
default_config:
  mode: GENERATION
  target_ratio: 1.0
tables:
  - table_name_with_schema: "public.employees"
    cycle_breaker_references: ["public.employees"]

Where the employees table contains a cycle reference.

Enum values
  • FAIL

  • DELETE_NOT_REQUIRED

Schema Creation Mode

optional String.
Defines the mode of schema creation.

CREATE_IF_NOT_EXISTS

if this mode is selected, DDL schema will be copied from the source database to the target one if it does not exist, existing schema will be used otherwise.

DO_NOT_CREATE

if this mode is selected, the existing schema will be used without any validations. Please use this mode carefully: run-time errors may occur if the input and output schema do not match.

CREATE

if this mode is selected, DDL schema will be copied from the source database to the target one. The target database should be empty.

DROP_AND_CREATE

if this mode is selected, DDL schema will be copied from the source database to the target one. Existing schema in the target database will be dropped. Please use this mode carefully.

Enum values
  • DO_NOT_CREATE

  • CREATE

  • CREATE_IF_NOT_EXISTS

  • DROP_AND_CREATE

Table truncation mode

optional String.
Defines the mode of table truncation.

DO_NOT_TRUNCATE

(default) if this mode is selected, tables in the target database won’t be truncated. An empty target database required.

TRUNCATE

if this mode is selected, tables in the target database will be truncated.

IGNORE

if this mode is selected, the status of the target database is ignored.

Enum values
  • DO_NOT_TRUNCATE

  • TRUNCATE

  • IGNORE

Safety Mode

Used in: safety_mode

optional String.
The mode of choosing transformations.

STRICT

if this mode is selected and no matching transformation found, then no passthrough, categorical_generator and null_generator transformations will be chosen by default. If no matching transformation found, an error occurs.

RELAXED

if this mode is selected and no matching transformation found, then the passthrough transformation will be applied for MASKING mode, and null_generator or categorical_generator for GENERATION mode.

This property has no meaning for KEEP mode. Default: STRICT.

Enum values
  • STRICT

  • RELAXED

TransformationConfigs

map of String keys to`User single database config`.

Map of named transformation configs. Keys of this map are transformation config ids and values are objects of type User single database config.

Mapping of data sources

map of String keys to`DataSourceMappingEntry`.

Mapping of data sources defined in the Inventory file.

Transformation mode

Used in: mode, mode

optional String.
Defines table processing mode.

KEEP

if this mode is selected, the original data will be copied as it is. When this mode is selected, the output size needs to be smaller than the input, i.e. target_ratio <= 1.

MASKING

if this mode is selected, masking transformations will be applied to the original data. When this mode is selected, the output size needs to be smaller than the input, i.e. target_ratio <= 1.

GENERATION

if this mode is selected, the synthesized engine will learn the original data and generate new synthetic data. For this mode, the output database can be bigger than the input, so target_ratio can be greater than 1.

Both KEEP and MASKING modes apply a transformation to original data. While KEEP uses passthrough as default transformation, while MASKING automatically assigns a privacy preserving masking transformation to all columns. See transformations list for more details. For all modes, the user can override default transformers.
Enum values
  • MASKING

  • GENERATION

  • KEEP

Default configuration item

Used in: items

Object.
The rule that applies to the tables by default, written in form "if given conditions are met, the given Transformation parameters are applied."

Properties

  • id: optional String.
    Optional identifier of the rule (generally needed for debugging).

The list of conditions that must be met in order for the transformation params to be applied.

Column transformation parameters

Used in: transformations

Object.
List of column names associated with Transformation parameters.

Properties

  • columns: array of String.
    List of columns that are affected by this generator.

DataSourceMappingEntry

Object.

Properties

  • source: String.

  • targets: array of String.

Condition

Used in: conditions

Object.
Condition on which the default parameters are being applied.

Depending on type property value, can be one of the following:

is_key

is_primary_key

is_foreign_key

is_ignored_foreign_key

is_unique

is_nullable

mode_in

true

data_type

unique_values

is_empty

parent_transformation

is_uuid

is_identity

single_distinct_value

distinct_values

std_dev

not

safety_mode

is_ignored_table

ValueMapping

Used in: mapping

Object.
The pair of SQL expressions which are applied while reading and writing values from the specified column. The expression should be a valid SQL that can contain the macro ?. This macro will be replaced with a column name on reading and transformed value on writing.

When using plain SQL expressions, you must guarantee syntax integrity. You may create the possibility of malicious SQL injection. Be sure to properly use this functionality.

Properties

  • read: String.
    The SQL expression which is applied while reading the column value from a source database. The value "?" can be used if the mapping is not required.

  • write: String.
    The SQL expression which is applied while writing the column value to a target database. The value "?" can be used if the mapping is not required.

IsKeyCondition

The column is a part of either primary or a foreign key.

Properties

  • type = is_key

IsPrimaryKeyCondition

The column is a part of a primary key.

Properties

  • type = is_primary_key

IsForeignKeyCondition

The column is a part of a foreign key.

Properties

  • type = is_foreign_key

IsIgnoredForeignKeyCondition

Applied to columns that are the part of FK which is ignored due to cycles

Properties

  • type = is_ignored_foreign_key

IsUniqueCondition

The column is either a part of primary key or UNIQUE constraint.

Properties

  • type = is_unique

IsNullableCondition

The column is nullable

Properties

  • type = is_nullable

ModeInCondition

The transformation mode is in a given array.

Properties

  • type = mode_in

TrueCondition

Always true, thus making the rule applicable to every column.

Properties

  • type = true

DataTypeCondition

The column has one of the given data types.

Properties

  • type = data_type

UniqueValuesCondition

Check whether the given field can be modelled as a format preserving hashing column instead of categorical.

Properties

  • type = unique_values

  • unique_ratio_threshold: Number (double).
    The fraction of unique values. The null values are not taken into account.

  • min_table_size_threshold: Integer.
    Minimum table size, as for small tables the unique_ratio_threshold can lead to false positives.

IsEmptyCondition

The column is empty.

Properties

  • type = is_empty

ParentTransformationParamsCondition

If the field refers to a FK, check whether the parent column is transformed by a specific transformer

Properties

  • type = parent_transformation

IsUuidCondition

The column is of UUID type.

Properties

  • type = is_uuid

IsIdentityCondition

The column is IDENTITY.

Properties

  • type = is_identity

SingleDistinctValueCondition

The column contains only one distinct value. The null values are not taken into account.

Properties

  • type = single_distinct_value

DistinctValuesCondition

Check whether the given field distinct value number is greater than the specified threshold. The null values are not taken into account.

Properties

  • type = distinct_values

  • minimum_threshold: Integer.
    Distinct value number (non-negative)

StandardDeviationCondition

Check whether the given field standard deviation is greater than the specified threshold.

Properties

  • type = std_dev

  • std_dev_threshold: Number (double).
    Standard deviation threshold (non-negative)

NotCondition

Negation of the specified condition

Properties

  • type = not

SafetyModeCondition

The given safety mode is used.

Properties

  • type = safety_mode

IsIgnoredTableCondition

True if the table is ignored, e.g. target_ratio is 0

Properties

  • type = is_ignored_table

Transformation data type

Used in: data_type

String.

Enum values
  • TEXT

  • NUMERIC

  • DATE

  • BOOLEAN

  • ANY