Configuration reference

For TDK configuration we use YAML, which is a very simple and very powerful language.

Thanks to YAML, TDK allows:

Human-readable config
Ease and simplicity of config editing (you can just use your favorite text editor)
Code reuse (using standard anchors and aliases)

YAML configuration

Can be one of

User single database config
User multiple databases config

User single database config

Object.
User configuration used when working with a single database

Properties

default_config: Default configuration.

tables: array of User-defined table configuration.

cycle_resolution_strategy: Cycle resolution strategy.

schema_creation_mode: Schema Creation Mode.

table_truncation_mode: Table truncation mode.

safety_mode: Safety Mode.

global_seed: String.
A value used a seed for random number generators. Should be string in base64 encoded format. Basic alphabet (please refer to https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/Base64.html) The result of generation must be the same each time the generation is being run with the same seed and workflow configuration. By default, this is set to an empty string, which corresponds to an empty byte array.

schemas: array of String.
List of schemas to process. If not set or null, all available schemas will be processed.

metadata: Additional Metadata.

use_text_column_heuristics: Boolean.
Enables column and table name heuristics for selecting transformations for text fields. If enabled, the address_generator, person_generator and finance_generator can be chosen for text fields in GENERATION mode.

flags: array of Configuration Flag.

scripts: Pre and post execution scripts.

User multiple databases config

Object.
User configuration used when working with multiple databases

Properties

transformation_configs: TransformationConfigs.

synchronised_transformations: array of array of String.
data_source_mapping: Mapping of data sources.

Default configuration

Used in: default_config

optional Object.
Consists of the default table configuration parameters that are applied to all tables by default and a list of rules that are conditionally applied to the table.

Properties

mode: Transformation mode.

target_ratio: optional Number (double).
The relative size of each table of the output database with respect to the input. The number of rows of each output table will be computed by multiplying this parameter by the input table size. If not provided, this parameter will be target_ratio = 1, resulting on same size for input and output databases. Can be overridden by target_row_number

target_row_number: optional Integer (int64).
The absolute size of each table of the output database in rows. This parameter is applicable only for GENERATION mode. If not provided, target_ratio will be used.

insert_batch_size: optional Integer.
Indicates how many table rows are inserted to the database per each batch operation. The default value for this parameter is set at 1000 rows.

use_working_directory: Boolean.
Whether to use the configured local working directory to perform certain I/O operations more efficiently. In some cases, using working directory can speed up table insertion speed by several times. However, the more tables use the working directory, the more space is needed for it. Note that this setting is only effective when the working directory is configured globally (see Working directory properties).

locale: String.
Default locale for specific transformations, such as person_generator and address_generator. This value can be overridden at the column transformation level.

Supported locales:

ar

bg

ca

ca-CAT

cs

da-DK

de

de-AT

de-CH

en

en-AU

en-CA

en-GB

en-IND

en-MS

en-NEP

en-NG

en-NZ

en-PAK

en-SG

en-UG

en-US

en-ZA

en-PH

es

es-MX

fa

fi-FI

fr

he

hu

in-ID

it

ja

ko

nb-NO

nl

pl

pt

pt-BR

ru

sk

sv

sv-SE

tr

uk

vi

zh-CN

zh-TW

in_memory_filter_threshold: optional Integer.
Advanced performance tuning parameter. This parameter determines whether values the parent table are loaded into memory during filtering operations, depending on the table size. If the size of the table does not exceed this specified value, the values will be loaded into memory for filtering for child tables. The higher this threshold, the faster the filter can potentially run, provided that all values fit in memory. But if there is not enough memory, it may cause an error. The default value for this parameter is set at 100_000 rows.

items: array of Default configuration item.

The rules used to select the appropriate transformation depending on the database metadata if one is not specified in the user’s configuration.

User-defined table configuration

Used in: tables

Object.
The parameters defined in the default configuration are applied to all tables in the database, so there’s no need to configure each table individually. But if needed, the user can override default configuration for any specific table present in the database. For each table, the user can create a UserTableConfig and add it to the list tables of YAML configuration.

Properties

table_name_with_schema: String.
The name of the table affected by this UserTableConfig. This value must contain both the schema name and table name, separated by a dot. In other words, this value must follow the format $schema.$table. In most cases, this value is case-insensitive (except when a database object has been created with a double-quoted name). Here are some examples of valid values: HR.employees, SCOT.SALGRADE, sakila.film, public.Actor, db2inst1.SALES. In these examples, HR, SCOT, sakila, public and db2inst1 are schema names, while employees, SALGRADE, film, Actor and SALES are table names.

In MySQL, a schema is synonymous with a database, as defined in the MySQL Glossary. Therefore, in MySQL, you should always place the database name as the first component of this property.

mode: Transformation mode.

transformations: array of Column transformation parameters.

target_ratio: optional Number (double).
The relative size of the output table with respect to the input. The number of rows of each output table will be computed by multiplying this parameter by the input table size. If not provided, this parameter will be target_ratio = 1, resulting on same size for input and output databases. Can be overridden by target_row_number

When setting target_ratio at a table level, the result may end up being smaller than the given value due to relationships with parent table. For example, if a customer table is set to target_ratio = 0.5, and its child table transactions has target_ratio = 1.0, the output transaction table will also end with half it’s samples due to its downstream dependency to the reduced table customer.

target_row_number: optional Integer (int64).
The absolute size of the output table in rows. This parameter is applicable only for GENERATION mode. If not provided, target_ratio will be used.

cycle_breaker_references: array of String.
When Cycle resolution strategy is FAIL, this list may contain a list of table names, references to which will be ignored during the data generation.

insert_batch_size: optional Integer.
Indicates how many table rows are inserted to the database per each batch operation. The default value for this parameter is set at 1000 rows.

use_working_directory: optional Boolean.
filter: String.
An expression to filter a source table data. Any SQL expression that can be used as the search condition in WHERE clause. For example:

org_id != 120 and custom_function() = 'foo'

in_memory_filter_threshold: optional Integer.
Advanced performance tuning parameter. This parameter determines whether values the parent table are loaded into memory during filtering operations, depending on the table size. If the size of the table does not exceed this specified value, the values will be loaded into memory for filtering for child tables. The higher this threshold, the faster the filter can potentially run, provided that all values fit in memory. But if there is not enough memory, it may cause an error. The default value for this parameter is set at 100_000 rows.

Cycle resolution strategy

Used in: cycle_resolution_strategy

optional String.
Defines how to deal with cycles in table relations via foreign keys.

FAIL: if this mode is selected, cycle_breaker_references should be provided in the configuration file. Otherwise, execution will fail if a circular reference is detected in the schema.
DELETE_NOT_REQUIRED: if this mode is selected, cyclic references will be resolved automatically by removing the last nullable reference that leads to the cycle. As a result, the cyclic reference columns will be masked to nulls.
DEFER_FOREIGN_KEY: if this mode is selected, cyclic references will be automatically resolved by disabling references that lead to the cycles during insertion, and then re-enabled back. As a result, the columns with cyclic references will be masked like any other keys in the schema. This strategy can only be used with the MASKING mode without subsetting, and only for Oracle, MSSQL, and DB2.

Example for a cycle breaker reference:

schema_creation_mode: CREATE_IF_NOT_EXISTS
cycle_resolution_strategy: FAIL
table_truncation_mode: TRUNCATE
safety_mode: "RELAXED"
default_config:
  mode: GENERATION
  target_ratio: 1.0
tables:
  - table_name_with_schema: "public.employees"
    cycle_breaker_references: ["public.employees"]

Where the employees table contains a cycle reference.

Enum values

FAIL
DELETE_NOT_REQUIRED
DEFER_FOREIGN_KEY

Schema Creation Mode

Used in: schema_creation_mode

optional String.
Defines the mode of schema creation.

The schema creation feature only restores parts of the original schema required for Synthesized execution, such as tables, constraints, and indexes. Other database objects such as triggers, views, procedures, functions, and any vendor-specific properties are not supported.

CREATE_IF_NOT_EXISTS

if this mode is selected, DDL schema will be copied from the source database to the target one if it does not exist, existing schema will be used otherwise.

DO_NOT_CREATE

if this mode is selected, the existing schema will be used without any validations. Please use this mode carefully: run-time errors may occur if the input and output schema do not match.

CREATE

if this mode is selected, DDL schema will be copied from the source database to the target one. The target database should be empty.

DROP_AND_CREATE

if this mode is selected, DDL schema will be copied from the source database to the target one. Existing schema in the target database will be dropped. Please use this mode carefully.

Enum values

DO_NOT_CREATE
CREATE
CREATE_IF_NOT_EXISTS
DROP_AND_CREATE

Table truncation mode

Used in: table_truncation_mode

optional String.
Defines the mode of table truncation.

DO_NOT_TRUNCATE

(default) if this mode is selected, tables in the target database won’t be truncated. An empty target database required.

TRUNCATE

if this mode is selected, tables in the target database will be truncated.

IGNORE

if this mode is selected, the status of the target database is ignored.

Enum values

DO_NOT_TRUNCATE
TRUNCATE
IGNORE

Safety Mode

Used in: safety_mode

optional String.
The mode of choosing transformations.

STRICT: if this mode is selected and no matching transformation found, then no passthrough, categorical_generator, null_generator, constant_generator transformations will be chosen by default. If no matching transformation found, an error occurs. If DEFER_FOREIGN_KEY cycle resolution strategy fails to re-enable a foreign key after transformation, an error occurs.
RELAXED: if this mode is selected and no matching transformation found, then the passthrough transformation will be applied for MASKING mode, and null_generator or categorical_generator for GENERATION mode. Constant generators constant_numeric, constant_string, constant_date, constant_boolean will be chosen by default where the source column contains the same value in all rows. If DEFER_FOREIGN_KEY cycle resolution strategy fails to re-enable a foreign key after transformation, a warning is shown in logs and transformation continues.

This property has no meaning for KEEP mode. Default: STRICT.

Enum values

STRICT
RELAXED

Additional Metadata

Used in: metadata

optional Object.
Metadata that is not defined at the database level but must be taken into account when performing transformations.

Properties

tables: array of Table Additional Metadata.

Configuration Flag

Used in: flags

optional String.
Options for TDK. Available flags:

DISABLE_DB_INDEXES_ON_WRITE

If this option enabled, TDK will try to temporarily disable non-unique database indexes before inserting data, subsequently re-enabling them afterward. This should enhance performance during write operations. Presently, this feature is available only for SQL Server.

NOTE: Indexes that are already disabled will be also enabled after the processing is complete.

DISABLE_DB_TRIGGERS_ON_WRITE

If this option enabled, TDK will try to temporarily disable triggers before inserting data, subsequently re-enabling them afterward. This should enhance performance during write operations and prevent side-effects from triggers. Presently, this feature is available only for SQL Server.

NOTE: If TDK failed during execution part of disabled triggers might be not enable back.

USE_FILE_BATCH_INSERT

if this flag is selected, TDK will try to use file batch insert. Works only for SQL Server. Doesn’t work with direct subsetting (without a working directory). Value mapping doesn’t work when this flag is enabled. Might be high CPU efficient.

NOT_FALLBACK_TO_ONE_BY_ONE_INSERTS

By default, insertion is performed in batches of configurable sizes (see insert_batch_size in Default configuration). When an error is encountered during the insertion, an attempt may be made to fall back to inserting single records instead of batch. By default, for DB2 and Oracle, TDK falls back to inserting single records instead of batch. For SQL Server, Postgres, MySQL and others TDK falls back to one-by-one inserts only when the SQL error is INTEGRITY_CONSTRAINT_VIOLATION or DATA_EXCEPTION. The NOT_FALLBACK_TO_ONE_BY_ONE_INSERTS flag: Turn off fallback to one-by-one inserts. If one of the batches gets an error all TDK execution will fail. Conflicts with the SKIP_FAILED_BATCHES and FORCE_FALLBACK_TO_ONE_BY_ONE_INSERTS flag.

SKIP_FAILED_BATCHES

By default, insertion is performed in batches of configurable sizes (see insert_batch_size in Default configuration). When an error is encountered during the insertion, an attempt may be made to fall back to inserting single records instead of batch. By default, for DB2 and Oracle, TDK falls back to inserting single records instead of batch. For SQL Server, Postgres, MySQL and others TDK falls back to one-by-one inserts only when the SQL error is INTEGRITY_CONSTRAINT_VIOLATION or DATA_EXCEPTION. The SKIP_FAILED_BATCHES flag: Skip failed batches without resorting to one-by-one inserts. Conflicts with the NOT_FALLBACK_TO_ONE_BY_ONE_INSERTS and FORCE_FALLBACK_TO_ONE_BY_ONE_INSERTS flag.

FORCE_FALLBACK_TO_ONE_BY_ONE_INSERTS

By default, insertion is performed in batches of configurable sizes (see insert_batch_size in Default configuration). When an error is encountered during the insertion, an attempt may be made to fall back to inserting single records instead of batch. By default, for DB2 and Oracle, TDK falls back to inserting single records instead of batch. For SQL Server, Postgres, MySQL and others TDK falls back to one-by-one inserts only when the SQL error is INTEGRITY_CONSTRAINT_VIOLATION or DATA_EXCEPTION. The FORCE_FALLBACK_TO_ONE_BY_ONE_INSERTS flag: Force one-by-one inserts in case of failed batches and ignore the type of SQL error. Conflicts with the NOT_FALLBACK_TO_ONE_BY_ONE_INSERTS and SKIP_FAILED_BATCHES flag.

Enum values

DISABLE_DB_INDEXES_ON_WRITE
DISABLE_DB_TRIGGERS_ON_WRITE
USE_FILE_BATCH_INSERT
NOT_FALLBACK_TO_ONE_BY_ONE_INSERTS
SKIP_FAILED_BATCHES
FORCE_FALLBACK_TO_ONE_BY_ONE_INSERTS

Pre and post execution scripts

Used in: scripts

Object.
SQL scripts to be executed on the target database before or after the transformation process.

Example:

scripts:
  pre:
    source: INLINE
    script: |
      ALTER TABLE public.transaction DISABLE TRIGGER ALL;
  post:
    source: FILE
    path: post_script.sql

The post script will be executed after the transformation process in any case, including errors during the execution of the pre-script or the transformation process.

Properties

pre: Pre and post execution script.

post: Pre and post execution script.

TransformationConfigs

Used in: transformation_configs

map of String keys to`User single database config`.

Map of named transformation configs. Keys of this map are transformation config ids and values are objects of type User single database config.

Mapping of data sources

Used in: data_source_mapping

map of String keys to`DataSourceMappingEntry`.

Mapping of data sources defined in the Inventory file.

Transformation mode

Used in: mode, mode

optional String.
Defines table processing mode.

KEEP: if this mode is selected, the original data will be copied as it is. When this mode is selected, the output size needs to be smaller than the input, i.e. target_ratio <= 1.
MASKING: if this mode is selected, masking transformations will be applied to the original data. When this mode is selected, the output size needs to be smaller than the input, i.e. target_ratio <= 1.
GENERATION: if this mode is selected, the synthesized engine will learn the original data and generate new synthetic data. For this mode, the output database can be bigger than the input, so target_ratio can be greater than 1.

Both KEEP and MASKING modes apply a transformation to original data. While KEEP uses passthrough as default transformation, while MASKING automatically assigns a privacy preserving masking transformation to all columns. See transformations list for more details. For all modes, the user can override default transformers.

Enum values

MASKING
GENERATION
KEEP

Default configuration item

Used in: items

Object.
The rule that applies to the tables by default, written in form "if given conditions are met, the given Transformation parameters are applied."

Properties

id: optional String.
Optional identifier of the rule (generally needed for debugging).

conditions: array of Condition.

The list of conditions that must be met in order for the transformation params to be applied.

transformation: Transformation parameters.

Column transformation parameters

Used in: transformations

Object.
List of column names associated with Transformation parameters.

Properties

columns: array of String.
List of columns that are affected by this generator.

params: Transformation parameters.

id: optional String.
mapping: ValueMapping.

mode: Transformation mode.

Table Additional Metadata

Used in: tables

Object.

Properties

table_name_with_schema: String.
foreign_keys: Additional Foreign Keys.

Pre and post execution script

Used in: pre, post

optional Object.

Depending on source property value, can be one of the following:

INLINE

Inline pre and post execution script

FILE

File pre and post execution script

DataSourceMappingEntry

Used in: data_source_mapping

Object.

Properties

source: String.
targets: array of String.

Condition

Used in: conditions

Object.
Condition on which the default parameters are being applied.

Depending on type property value, can be one of the following:

is_key

IsKeyCondition

is_primary_key

IsPrimaryKeyCondition

is_foreign_key

IsForeignKeyCondition

is_ignored_foreign_key

IsIgnoredForeignKeyCondition

is_unique

IsUniqueCondition

is_nullable

IsNullableCondition

mode_in

ModeInCondition

true

TrueCondition

data_type

DataTypeCondition

unique_values

UniqueValuesCondition

is_empty

IsEmptyCondition

parent_transformation

ParentTransformationParamsCondition

no_parent_transformation

NoParentTransformationParamsCondition

is_uuid

IsUuidCondition

is_identity

IsIdentityCondition

is_autogenerated

IsAutogeneratedCondition

single_distinct_value

SingleDistinctValueCondition

distinct_values

DistinctValuesCondition

std_dev

StandardDeviationCondition

not

NotCondition

safety_mode

SafetyModeCondition

is_ignored_table

IsIgnoredTableCondition

text_column_heuristics

TextColumnHeuristicsEnabledCondition

column_table_name_regex

ColumnTableNameRegexCondition

is_computable_column_condition

IsComputableColumnCondition

ValueMapping

Used in: mapping

Object.
The pair of SQL expressions which are applied while reading and writing values from the specified column. The expression should be a valid SQL that can contain the macro ?. This macro will be replaced with a column name on reading and transformed value on writing.

When using plain SQL expressions, you must guarantee syntax integrity. You may create the possibility of malicious SQL injection. Be sure to properly use this functionality.

Properties

read: String.
The SQL expression which is applied while reading the column value from a source database. The value "?" can be used if the mapping is not required.

write: String.
The SQL expression which is applied while writing the column value to a target database. The value "?" can be used if the mapping is not required.

Additional Foreign Keys

Used in: foreign_keys

map of String keys to`Additional Foreign Key`.

Additional foreign keys that are not defined at the database level but must be taken into account when performing transformations.

Inline pre and post execution script

optional Pre and post execution script

The inline SQL script that will be executed before or after the transformation process.

Properties

source = INLINE

script: String.
SQL script to be executed.

File pre and post execution script

optional Pre and post execution script

The file with SQL script that will be executed before or after the transformation process.

Properties

source = FILE

path: String.
Script file location. The script can be located on local file system, AWS S3 and Google Storage. In the case of a local file system, the path can be absolute or relative to the application process’s working directory (not to be confused with working directory) To be able to load scripts from S3 the property TDK_AWS_ENABLED==true should be set. More details can be found here. The property TDK_GCP_ENABLED==true allows loading scripts from Google Storage. More details can be found here.

IsKeyCondition

Condition

The column is a part of either primary or a foreign key.

Properties

type = is_key

IsPrimaryKeyCondition

Condition

The column is a part of a primary key.

Properties

type = is_primary_key

IsForeignKeyCondition

Condition

The column is a part of a foreign key.

Properties

type = is_foreign_key

IsIgnoredForeignKeyCondition

Condition

Applied to columns that are the part of FK which is ignored due to cycles

Properties

type = is_ignored_foreign_key

IsUniqueCondition

Condition

The column is either a part of primary key or UNIQUE constraint.

Properties

type = is_unique

IsNullableCondition

Condition

The column is nullable

Properties

type = is_nullable

ModeInCondition

Condition

The transformation mode is in a given array.

Properties

type = mode_in

modes: array of Transformation mode.

TrueCondition

Condition

Always true, thus making the rule applicable to every column.

Properties

type = true

DataTypeCondition

Condition

The column has one of the given data types.

Properties

type = data_type

data_type: Transformation data type.

UniqueValuesCondition

Condition

Check whether the given field can be modelled as a format preserving hashing column instead of categorical.

Properties

type = unique_values

unique_ratio_threshold: Number (double).
The fraction of unique values. The null values are not taken into account.

min_table_size_threshold: Integer.
Minimum table size, as for small tables the unique_ratio_threshold can lead to false positives.

IsEmptyCondition

Condition

The column is empty.

Properties

type = is_empty

ParentTransformationParamsCondition

Condition

If the field refers to a FK, check whether the parent column is transformed by a specific transformer

Properties

type = parent_transformation

parent_transformation_params: Transformation parameters.

NoParentTransformationParamsCondition

Condition

If the field refers to a FK, check that the parent transformation for specific column is not defined

Properties

type = no_parent_transformation

IsUuidCondition

Condition

The column is of UUID type.

Properties

type = is_uuid

IsIdentityCondition

Condition

The column is IDENTITY.

Properties

type = is_identity

IsAutogeneratedCondition

Condition

The column is AUTOGENERATED.

Properties

type = is_autogenerated

SingleDistinctValueCondition

Condition

The column contains only one distinct value. The null values are not taken into account.

Properties

type = single_distinct_value

DistinctValuesCondition

Condition

Check whether the given field distinct value number is greater than the specified threshold. The null values are not taken into account.

Properties

type = distinct_values

minimum_threshold: Integer.
Distinct value number (non-negative)

maximum_threshold: optional Integer.
Distinct value number (non-negative)

StandardDeviationCondition

Condition

Check whether the given field standard deviation is greater than the specified threshold.

Properties

type = std_dev

std_dev_threshold: Number (double).
Standard deviation threshold (non-negative)

NotCondition

Condition

Negation of the specified condition

Properties

type = not

condition: Condition.

SafetyModeCondition

Condition

The given safety mode is used.

Properties

type = safety_mode

mode: Safety Mode.

IsIgnoredTableCondition

Condition

True if the table is ignored, e.g. target_ratio is 0

Properties

type = is_ignored_table

TextColumnHeuristicsEnabledCondition

Condition

Check if the property use_text_column_heuristics is enabled

Properties

type = text_column_heuristics

ColumnTableNameRegexCondition

Condition

Check if the column and table name matches the specified regular expression patterns. During the detection process, table and column names will be preprocessed: * they will be lower-cased * non-alpha characters will be removed

Properties

type = column_table_name_regex

column_regex: String.
Column selection pattern.

table_regex: String.
Table selection pattern. Any table matches by default.

IsComputableColumnCondition

Condition

Applied to columns that are auto-computable.

Properties

type = is_computable_column_condition

Additional Foreign Key

Used in: foreign_keys

Object.

Properties

referred_database: String.
referred_schema: String.
referred_table: String.
columns: array of Additional Foreign Key Column Mapping.

Transformation data type

Used in: data_type

String.

Enum values

TEXT
NUMERIC
DATE
BOOLEAN
ANY

Additional Foreign Key Column Mapping

Used in: columns

Object.

Properties

column: String.
referred_column: String.