Transformations

Randomly sample from a given key / value of categories and weights. Weights and categories can be provided from various datasources or learned from data.

The following example demonstrates how to accomplish this:

    transformations:
      - columns:
          - transaction_type
        params:
          type: categorical_generator
          categories:
            value_source: PROVIDED
            values:
              "sent": 0.5
              "received": 0.4
              "skipping_value": 0.0
              "null": 0.1

You can also specify a CSV file as the source of categories. The following example illustrates the complete configuration example:

    transformations:
      - columns:
          - transaction_type
        params:
          type: categorical_generator
          categories:
            value_source: CSV_FILE
            path: src/e2e/resources/data_with_header.csv
            null_values: ["null", ""]
            format:
              columns:
                column_accessor_type: NAME
                categories: "title"
                weights: "rank"
              encoding: "UTF-8"
              delimiter: ","
              trim: true

For simple scenarios, the minimal configuration may be useful, which relies on the defaults mentioned earlier. The following example illustrates this approach:

    transformations:
      - columns:
          - transaction_type
        params:
          type: categorical_generator
          categories:
            value_source: CSV_FILE
            path: src/e2e/resources/data.csv

This transformation can be applied to several columns at once. This illustrates how the generator can handle multiple columns with provided tuples of categories that always appear together:

    transformations:
      - columns:
          - productcode
          - productname
        params:
          type: categorical_generator
          categories:
            value_source: MULTIPLE_PROVIDED
            null_values: ["nil"]
            category_values:
              - values:
                  productcode: "P1"
                  productname: "Product 1"
                weight: 0.5
              - values:
                  productcode: "P2"
                  productname: "Product 2"
                weight: 0.3
              - values:
                  productcode: "P3"
                  productname: "Product 3"
                weight: 0.2
              - values:
                  productcode: "nil"
                  productname: "nil"
                weight: 0.5

You can also specify a CSV file as the source of categories for multiple columns.

    transformations:
      - columns:
          - productcode
          - productname
        params:
          type: categorical_generator
          categories:
            value_source: MULTIPLE_CSV_FILE
            path: src/e2e/resources/data_multi.csv

Example of advanced configuration with multiple columns and multiple categories:

    transformations:
      - columns:
          - productcode
          - productname
        params:
          type: categorical_generator
          categories:
            value_source: MULTIPLE_CSV_FILE
            path: src/e2e/resources/data_with_header_multi.csv
            null_values: ["null", ""]
            format:
              columns:
                column_accessor_type: NAME
                categories:
                  productcode: "code"
                  productname: "name"
                weights: "rank"
              encoding: "UTF-8"
              delimiter: ","
              trim: true

This generator normalizes weights to the probability interval. If the sum of weights exceeds the capacity of the double format, you need to use weights with a smaller scale factor.

Properties

type = categorical_generator

categories: Categories source configuration.

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: Yes

Conditional generator

Uses one of two transformations/generators depending on the value of the given field of the parent table. For example, using conditional generator one may use different generators depending on the value of "gender" column of the parent table.

Example:

    transformations:
      - columns: [ "status" ]
        params:
          type: conditional_generator
          conditional_table: "public.delivery"
          conditional_column: "status"
          conditional_value: "DONE"
          if_true:
            type: constant_string
            value: "CLOSED"
          if_false:
            type: constant_string
            value: "OPEN"

Properties

type = conditional_generator

conditional_column: String.
Parent column.

conditional_table: optional String.
Parent table.

conditional_value: optional String.
Value to be compared with. If the value of the parent column is equal to conditional_value, then if_true generator is used, otherwise if_false generator.

if_false: Transformations.

if_true: Transformations.

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: Yes

Continuous generator

Output data is sampled from a parameterized continuous distribution. If parameters are not given, they will be fitted from the original data.

Example:

    transformations:
      - columns:
          - "amount"
        params:
          type: continuous_generator
          mean: 354.21
          std: 98.96
          min: 0.0

Properties

type = continuous_generator

mean: optional Number (double).
Mean of the sampled distribution

std: optional Number (double).
Standard Deviation

min: optional Number (double).
Minimum value

max: optional Number (double).
Maximum value

numeric_type: Numeric type.

round: Integer.
If given, output data will be rounded to this number of digits

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: NUMERIC

Supports multiple columns: No

Quantile generator

Given a list of probabilities and bin edges, the output data is sampled from a mixture of uniform distributions, where each uniform distribution i is chosen with probability probabilities[i] and its edges are given by bin_edges[i] and bin_edges[i + 1]. If parameters are not given, they will be fitted from the original data.

Example:

    transformations:
      - columns: ["amount"]
        params:
          type: quantile_generator
          hist: [0.5, 0.2, 0.3]
          bin_edges: [0.1, 0.15, 0.3, 0.45]
          numeric_type: DOUBLE

Properties

type = quantile_generator

hist: optional array of Number (double).
Probabilities of each uniform distribution.

bin_edges: optional array of Number (double).
Bin edges of each uniform distribution.

numeric_type: Numeric type.

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: NUMERIC

Supports multiple columns: No

Foreign key generator

Fills columns with the parent table’s primary key values of a random row.

Normally this generator is being created implicitly wherever data generation for tables related with foreign keys is needed, so it should not be explicitly set up by the user.

Properties

type = foreign_key_generator

distribution: Foreign key distribution.

parent_data_mode: Parent data mode.

referred_schema: optional String.
referred_table: optional String.
referred_fields: optional array of String.

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: Yes

Unique generator

This generator is intended for the case where primary key values are part of the foreign key.

Normally this generator is being created implicitly wherever data generation for tables related with foreign keys is needed, so it should not be explicitly set up by the user.

Properties

type = unique_generator

transformations: optional array of Column transformation parameters.

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: Yes

Format preserving hashing

A hash transformation is applied to each character, which included into the configured group, in a given text so that the output preserves the format but contains different characters. This transformation is secure and non-reversible.

Examples:

Default configuration:

    transformations:
      - columns: ["registration_number"]
        params:
          type: format_preserving_hashing
          groups:
            - selector:
                type: digits
              alphabets:
                - type: digits
            - selector:
                type: lower_letters
              alphabets:
                - type: lower_letters
            - selector:
                type: upper_letters
              alphabets:
                - type: upper_letters

Mask only last 5 characters:

    transformations:
      - columns: ["registration_number"]
        params:
          type: format_preserving_hashing
          filter:
            type: "last"
            n: 5

Mask only substring ignoring case:

    transformations:
      - columns: ["registration_number"]
        params:
          type: format_preserving_hashing
          filter:
            type: substring
            substring: sub
            ignore_case: true

Mask only a set of characters ignoring case:

    transformations:
      - columns: ["registration_number"]
        params:
          type: format_preserving_hashing
          filter:
            type: characters
            characters: "abc"
            ignore_case: true

Mask characters selected by regex with a custom alphabet:

    transformations:
      - columns: ["phone_number"]
        params:
          type: format_preserving_hashing
          groups:
            - selector:
                type: regex
                pattern: "[123]"
              alphabets:
                - type: custom
                  parts:
                    - type: characters
                      characters: "456"
                    - type: characters
                      characters: "789"
                    - type: unicode_block
                      name: LATIN_EXTENDED_D
                    - type: unicode_block
                      name: "Latin Extended-A"
                    - type: unicode_range
                      from: 0x0D00
                      to: 0x0D7F

Properties

type = format_preserving_hashing

groups: array of Hashing group.

Hashing groups to apply on top of the specified filter. There can be multiple groups configured. In that case the groups will be tried to match a region within the filtered value in the order they are specified in configuration. If a match is successfully found, the corresponding group’s alphabet will be used for transformation, and no other groups will be tried for that region. This implies that most specific hashing groups must be specified first in the configuration. Unspecified parameter or null is equivalent to the following:

    transformations:
      - columns: ["registration_number"]
        params:
          type: format_preserving_hashing
          groups:
            - selector:
                type: digits
              alphabets:
                - type: digits
            - selector:
                type: lower_letters
              alphabets:
                - type: lower_letters
            - selector:
                type: upper_letters
              alphabets:
                - type: upper_letters
            - selector:
                type: word_characters
              alphabets:
                - type: digits
                - type: lower_letters
                - type: upper_letters

locale: optional String.
To generate a string sequence with letters from different alphabets, the user can change this parameter when a hashing group is not explicitly specified. The default value is en-GB, representing the Latin alphabet.

Supported locales:
- en, ca, cs, da, de, es, fi, fr, hu, in, it, nb, nl,pl, pt, sk, sv, tr, vi - The Latin alphabet characters
- ar, fa - The ARABIC Unicode block
- bg, ru, uk - The CYRILLIC Unicode block
- he - The HEBREW Unicode block
- ja - The HIRAGANA, KATAKANA, CJK_UNIFIED_IDEOGRAPHS Unicode block
- ko - The HANGUL_JAMO Unicode block
- zh - The CJK_UNIFIED_IDEOGRAPHS Unicode block

filter: Format preserving hashing filter.

length_threshold: optional Integer (int32).
The length threshold at which the masker is applicable. If any value in the column exceeds this length, execution will not start and will raise a validation error.

Compatible modes: MASKING, KEEP

Compatible column data types: TEXT

Supports multiple columns: No

Formatted string generator

Generate a string column based on a given pattern. If the pattern is not given, will generate random characters with similar length as original column.

Example:

    transformations:
      - columns:
          - "phone_number"
        params:
          type: formatted_string_generator
          pattern: "\\+44[0-9]{10}"

Properties

type = formatted_string_generator

pattern: optional String.
Regular expression pattern used to sample data from

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: No

Integers sequence generator

Generate a sequence of integers that represent a unique id column that contain unique values.

Properties

type = int_sequence_generator

start_from: optional Integer.
Where to start the sequence from, default to 0. If the generator is used on existing data, this should be used as the maximum of the existing data plus 1.

Example:

    transformations:
      - columns:
          - "user_id"
        params:
          type: int_sequence_generator

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: NUMERIC

Supports multiple columns: No

String sequence generator

Generate a sequence of strings that represent a unique id column that contain unique values, including uppercase alphabetic and numeric values.

Example:

    transformations:
      - columns: ["country_id"]
        params:
          type: string_sequence_generator

Properties

type = string_sequence_generator

length: optional Integer.
Maximum length of the column, extracted from the database DDL if not given

start_from: optional String.
Where to start the sequence from, default empty string. If the generator is used on existing data, this should be used as the maximum of the existing data with shift.

alphabets: array of Alphabet.

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: TEXT

Supports multiple columns: No

Noising transformation

Add laplacian noise to the input column in order to protect the privacy but output similar values.

Example:

    transformations:
      - columns: ["product_price"]
        params:
          type: noising
          sensitivity: 23.47
          min: 0

Properties

type = noising

sensitivity: optional Number (double).
Amount of noise to be added

min: optional Number (double).
If there’s a hard minimum, transformation will truncate output values there if smaller

max: optional Number (double).
If there’s a hard maximum, transformation will truncate output values there if greater

Compatible modes: MASKING, KEEP

Compatible column data types: NUMERIC

Supports multiple columns: No

Null generator

The output column is filled with null values

Example:

    transformations:
      - columns: ["empty_column"]
        params:
          type: null_generator

Properties

type = null_generator

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: Yes

Passthrough transformation

The output data is equal to the input, no transformation is applied.

Example:

    transformations:
      - columns: ["customer_number", "plate"]
        params:
          type: passthrough

Properties

type = passthrough

Compatible modes: MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: Yes

Person generator

Generate personal fields (e.g., name, surname, title) and keep them consistent across columns.

Available templates are:

${email}
${title}
${first_name}
${male_first_name}
${female_first_name}
${last_name}
${full_name}
${username}
${company}
${phone_national}
${phone_international}
${ssn}

Supported locales:

ar

bg

ca

ca-CAT

cs

da-DK

de

de-AT

de-CH

en

en-AU

en-CA

en-GB

en-IND

en-MS

en-NEP

en-NG

en-NZ

en-PAK

en-SG

en-UG

en-US

en-ZA

en-PH

es

es-MX

fa

fi-FI

fr

he

hu

in-ID

it

ja

ko

nb-NO

nl

pl

pt

pt-BR

ru

sk

sv

sv-SE

tr

uk

vi

zh-CN

zh-TW

Example for several columns:

    transformations:
      - columns: ["first_name", "last_name"]
        params:
          type: person_generator
          column_templates: ["${first_name}", "${last_name}"]

Example for a single column:

    transformations:
      - columns: ["full_name"]
        params:
          type: person_generator
          column_templates: ["${first_name} ${last_name}"]

Properties

type = person_generator

column_templates: array of String.
For each column, the template to be used to generate personal data

consistent_with_column: String.
If given, the column that need to be consistent on. For example, if consistent_with_column="user_id" all people with same user_id will have the same name. The "self" value means consistency with the source value.

locale: optional String.
To generate names from different geographical areas, the user can change this parameter. Default to 'en-GB', which corresponds to British names.

length_exceeded_mode: Value length exceeded mode.

column_lengths: optional array of Integer.
Max lengths allowable for the column. Ignored when length_exceeded_mode: "IGNORE"`

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: TEXT

Supports multiple columns: Yes

Address generator

Generate address fields (e.g., street, zip code) and keep them consistent across columns. Available templates are:

${zip_code}
${country}
${city}
${street_name}
${house_number}
${flat_number}
${full_address}
${street_address}
${region}
${latitude}
${longitude}
${coordinates}
${time_zone}

Supported locales:

ar

bg

ca

ca-CAT

cs

da-DK

de

de-AT

de-CH

en

en-AU

en-CA

en-GB

en-IND

en-MS

en-NEP

en-NG

en-NZ

en-PAK

en-SG

en-UG

en-US

en-ZA

en-PH

es

es-MX

fa

fi-FI

fr

he

hu

in-ID

it

ja

ko

nb-NO

nl

pl

pt

pt-BR

ru

sk

sv

sv-SE

tr

uk

vi

zh-CN

zh-TW

Example for several columns:

    transformations:
      - columns: [ "street_name", "zip_code" ]
        params:
          type: address_generator
          column_templates: [ "${street_name}", "${zip_code}" ]

Example for a single column:

    transformations:
      - columns: ["address"]
        params:
          type: address_generator
          column_templates: ["${country}, ${city}, ${street_name}, ${house_number}, ${flat_number}, ${zip_code}"]

Properties

type = address_generator

column_templates: array of String.
For each column, the template to be used to generate address data

consistent_with_column: String.
If given, the column that need to be consistent on. For example, if consistent_with_column="user_id" all people with same user_id will have the same street. The "self" value means consistency with the source value.

locale: optional String.
To generate addresses from different geographical areas, the user can change this parameter. Default to 'en-GB', which corresponds to Great Britain addresses.

length_exceeded_mode: Value length exceeded mode.

column_lengths: optional array of Integer.
Max lengths allowable for the column. Ignored when length_exceeded_mode: "IGNORE"`

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: TEXT

Supports multiple columns: Yes

Finance generator

Generate financial data.

Available templates:

${credit_card}
${bic}
${iban}
${nasdaq_ticker}
${nyse_ticker}
${stock_market}
${us_routing_number}

The template credit_card (without the card type qualification) will result in a random type being picked.

The credit card type can be configured using the following templates:

${credit_card.visa}
${credit_card.mastercard}
${credit_card.discover}
${credit_card.american_express}
${credit_card.diners_club}
${credit_card.jcb}
${credit_card.switch}
${credit_card.solo}
${credit_card.dankort}
${credit_card.forbrugsforeningen}
${credit_card.laser}

Example:

    transformations:
      - columns: [ "credit_card" ]
        params:
          type: finance_generator
          column_templates: [ "${credit_card.visa}" ]

Properties

type = finance_generator

column_templates: array of String.
For each column, the template to be used to generate financial data

consistent_with_column: String.
If given, the column that need to be consistent on. The "self" value means consistency with the source value.

length_exceeded_mode: Value length exceeded mode.

column_lengths: optional array of Integer.
Max lengths allowable for the column. Ignored when length_exceeded_mode: "IGNORE"`

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: TEXT

Supports multiple columns: Yes

Redaction masker

Some values in the input string are substituted by the same value, obtaining partially masked text in the output.

Example:

    transformations:
      - columns: ["credit_card"]
        params:
          type: redaction
          action: MASK
          which: FIRST
          count: 4
          mask_with: "#"

Properties

type = redaction

action: Action.

which: Position.

count: Integer.
amount of characters to be masked or kept, default to 4

mask_with: String.
character used to mask values, default to *

Compatible modes: MASKING, KEEP

Compatible column data types: TEXT

Supports multiple columns: No

Unique hashing

Apply a hash transformation to a given value so that the output is encrypted but structural coherence is preserved (same input hashed with same key is always going to produce same output). Output values are unique.

This transformation is applied to primary and foreign keys by default in MASKING mode.

Example:

    transformations:
      - columns: ["card_id"]
        params:
          type: unique_hashing

Properties

type = unique_hashing

max_value: Number (double).
Max value to generate, null means absence of limit

precision: Integer.
Max precision to generate (e.g. if the value is 3, the maximal value is 999), null means absence of limit. Minimal value is applied if both max_value and precision are specified

Compatible modes: MASKING, KEEP

Compatible column data types: NUMERIC

Supports multiple columns: No

Date unique hashing

Apply a hash transformation to a date time format value so that the output is encrypted but structural coherence is preserved (same input hashed with same key is always going to produce same output). Output values are unique.

This transformation is applied to primary and foreign keys by default in MASKING mode.

Example:

    transformations:
      - columns: ["create_date"]
        params:
          type: date_time_unique_hashing
          min: 2000-01-01T12:00:00Z
          max: 2022-01-01T12:00:00Z

Properties

type = date_time_unique_hashing

min: optional String (date-time).
Minimum value

max: optional String (date-time).
Maximum value

Compatible modes: MASKING, KEEP

Compatible column data types: DATE

Supports multiple columns: No

Date generator

Output data is sampled from a parameterized continuous distribution, and transformed into dates. If parameters are not given, they will be extracted from the original data

Example:

    transformations:
      - columns:
          - "date_of_birth"
        params:
          type: date_generator
          mean: 2018-02-01T12:00:00Z
          std: 2d 4h 45m 12s 434ms
          min: 2000-01-01T12:00:00Z
          max: 2022-01-01T12:00:00Z

Properties

type = date_generator

mean: optional String (date-time).
Average date of the sampled distribution

std: optional String.
Standard deviation. The following formats are accepted:
- ISO-8601 Duration format, e.g., P1DT2H3M4.058S.
- The concise format described here, e.g., 10s, 1h 30m or -(1h 30m)
- Milliseconds without the specific unit, e.g., 12534.

min: optional String (date-time).
Minimum value

max: optional String (date-time).
Maximum value

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: DATE

Supports multiple columns: No

UUID generator

The output column is filled with UUIDs.

Example:

    transformations:
      - columns: ["unique_id"]
        params:
          type: uuid_generator

Properties

type = uuid_generator

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: No

Constant numeric generator

Generates a single numeric value for the entire column

Example:

    transformations:
      - columns: [ "balance" ]
        params:
          type: constant_numeric
          value: 0.0

Example (range):

    transformations:
      - columns: [ "balance" ]
        params:
          type: constant_numeric
          min: 0.0
          max: 10000.0

Properties

type = constant_numeric

value: optional Number.
numeric value to generate

min: optional Number.
The lower boundary for the value (inclusive)

max: optional Number.
The upper boundary for the value (exclusive)

numeric_type: Numeric type.

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: NUMERIC

Supports multiple columns: No

Constant string generator

Generates a single string value for the entire column

Example:

    transformations:
      - columns: [ "status" ]
        params:
          type: constant_string
          value: "ACTIVE"

Properties

type = constant_string

value: optional String.
string value to generate

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: TEXT

Supports multiple columns: No

Default value generator

Generates a single default value for the entire column based on its type. It will generate default values as follows:

+ JSON, JSONB: an empty document like {} XML: an empty root document like <root/>

Properties

type = default_value

default_value: optional String.
Default value

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: No

Constant date generator

Generates a single date value for the entire column

Example:

    transformations:
      - columns: [ "creation_date" ]
        params:
          type: constant_date
          value: 2022-07-28T12:21:00Z

Example (range):

    transformations:
      - columns: [ "creation_date" ]
        params:
          type: constant_date
          min: 2022-07-01T00:00:00Z
          max: 2022-07-31T23:59:59Z

Properties

type = constant_date

value: optional String (date-time).
date value to generate

min: optional String (date-time).
The lower boundary for the value (inclusive)

max: optional String (date-time).
The upper boundary for the value (exclusive)

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: DATE

Supports multiple columns: No

Constant boolean generator

Generates a single boolean value for the entire column

Example:

    transformations:
      - columns: [ "is_active" ]
        params:
          type: constant_boolean
          value: true

Properties

type = constant_boolean

value: optional Boolean.
boolean value to generate

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: BOOLEAN

Supports multiple columns: No

Loop generator

Generates a sequence of elements in a loop that can be repeated. Data can be obtained from various sources or generated based on existing data. The following example demonstrates how to accomplish this:

    transformations:
      - columns:
          - transaction_type
        params:
          type: "loop_generator"
          repeatable: true
          source:
            value_source: "PROVIDED"
            values:
              - "sent"
              - "skipping_value"
              - "received"
              - null

This transformation can be applied to several columns at once. This illustrates how the generator can handle multiple columns with provided tuples of values that always appear together:

    transformations:
      - columns:
          - productcode
          - productname
        params:
          type: "loop_generator"
          repeatable: true
          source:
            value_source: "MULTIPLE_PROVIDED"
            values:
              - productcode: "P1"
                productname: "Product 1"
              - productcode: "P2"
                productname: "Product 2"
              - productcode: "P3"
                productname: "Product 3"
              - productcode: null
                productname: null

You can also specify a CSV file as the source of elements. The following example illustrates the complete configuration example:

    transformations:
      - columns:
          - productcode
          - productname
        params:
          type: "loop_generator"
          repeatable: true
          source:
            value_source: "CSV_FILE"
            path: src/e2e/resources/data_with_header_multi.csv
            null_values: null
            format:
              encoding: "UTF-8"
              delimiter: ","
              trim: true
              columns:
                column_accessor_type: NAME
                names: ["code", "name"]

Properties

type = loop_generator

repeatable: Boolean.
The list will repeat itself in a loop if necessary. Otherwise, an exception will be thrown when the record number exceeds the size of the list.

source: Source of elements.

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: Yes

Constant XML generator

Generates a single XML value for the entire column. Example:

transformations:
  - columns: [ "xml_column" ]
    params:
      type: constant_xml
      value: "<root>test</root>"

Properties

type = constant_xml

value: String.
XML string value to generate

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: No

JSON Pointer Transformer

Transforms JSON value nodes indicated by JSON pointers. The rest of the values are kept as is.

Example:

    transformations:
      - columns: ["productspec"]
        params:
          type: "json_pointer_transformer"
          specifications:
            - pointers: [ "/sku" ]
              transformation:
                type: "format_preserving_hashing"
            - pointers: [ "/tags/0" ]
              transformation:
                type: "format_preserving_hashing"
              ignore_errors: true

Properties

type = json_pointer_transformer

specifications: array of JSON Pointer Transformer Specification.

locale: optional String.
To generate data from different geographical areas. Default to 'en-GB'.

auto_detect: Boolean.
When a JsonPointer is assigned to a particular key, it is automatically constructed based on the type of the column. This works for simple data types such as strings, numbers, and Boolean values. However, this method does not guarantee data consistency and may reduce the length of strings. If column or key consistency is required, the pointers must be set manually.

Compatible modes: MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: No

Xml XPath Transformer

Transforms XML value nodes indicated by XML XPath. The rest of the values are kept as is.

Example:

transformations:
  - columns: ["productspec"]
    params:
      type: "xpath_transformer"
      specifications:
        - queries: [ "/sku" ]
          transformation:
            type: "format_preserving_hashing"

Properties

type = xpath_transformer

specifications: array of Xml XPath Transformer Specification.

encoding: String.
This property enables to specify an encoding format.

Compatible modes: MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: No

Void Generator

An auxiliary transformer that throws an error when called. It is used only when it is necessary to ignore the processing of the entire table.

Properties

type = void_generator

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: Yes

Scripting Transformer

The scripting transformer allows you to implement own logic for both the GENERATION mode and the MASKING mode. Currently only Javascript implementation is supported. The script for GENERATION mode must define a lambda function that returns a dictionary where keys are column names, and the values are the desired values of the record. If a transformer is applied to a single column, the value may be returned instead of a dictionary. The script for the MASKING mode must define a lambda function with the two arguments ctx and originalRecord. The following example shows how to use a custom script for multiple columns and MASKING mode.

    transformations:
      - columns:
          - textdescription
          - htmldescription
        params:
          type: scripting_transformer
          language: "JAVASCRIPT"
          script:
            code: |
              /**
               * @typedef { Object.<string, *> | * } Result
               *
               * @param {GenerationContext} ctx
               * @param {Record} originalRecord
               * @returns {Result}
               */
              (ctx, originalRecord) => {
                const dict = originalRecord.asMap();
                const textDescriptionColumn = columns.get(0);
                const htmlDescriptionColumn = columns.get(1);
                const descriptionWithoutSpaces = dict.get(textDescriptionColumn).trim();
                return { [textDescriptionColumn]: descriptionWithoutSpaces, [htmlDescriptionColumn]: descriptionWithoutSpaces };
              }

The script for the GENERATION mode should define a lambda function with the single argument ctx. The following example shows how to use a custom script for GENERATION:

    transformations:
      - columns:
          - credit_card
        params:
          type: scripting_transformer
          language: "JAVASCRIPT"
          additional_properties:
            first_credit_card_digit: 4
          init_script:
            code: |
              /**
               * @returns {String}
               */
              function generateRandomCreditCardNumber() {
                let creditCardNumber = additionalProperties["first_credit_card_digit"]

                for (let i = 1; i < 16; i++) {
                  const digit = Math.floor(Math.random() * 10);
                  creditCardNumber += digit.toString();
                }

                return creditCardNumber;
              }
          script:
            code: |
              /**
               * @typedef { Object.<string, *> | * } Result
               *
               * @param {GenerationContext} ctx
               * @returns {Result}
               */
              (ctx) => generateRandomCreditCardNumber();

Properties

type = scripting_transformer

language: Scripting Language.

script: Script.

init_script: Script.

additional_properties: ScriptingAdditionalProperties.

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: Yes

Arithmetic Generator

The output column is filled with values that are calculated based on the given expression. Basic use case would be converting one unit to another, e.g., converting Celsius to Fahrenheit:

    transformations:
      - columns: ["temperature_fahrenheit"]
        params:
          type: arithmetic_generator
          expression: "temperature_celsius * 9 / 5 + 32"

The generator can also include more complex expressions that incorporate other generators. Additionally, it supports basic date arithmetic based on epoch milliseconds:

    transformations:
      - columns: ["card_expiry_date"]
        params:
          type: arithmetic_generator
          expression: "card_issue_date + card_validity_period + 1d"
          variables:
            "card_validity_period":
              type: categorical_generator
              categories:
                value_source: PROVIDED
                values:
                  "730d": 0.5
                  "1461d": 0.5

Properties

type = arithmetic_generator

variables: List of named variables.

constants: List of named constants.

is_transforming_dates: optional Boolean.
In most cases does not require manual set. Indicates if this transformation is applied to date. If true, the generator will operate on time units (ms, s, m, h, d). If not set the value will be inferred from the output column type.

expression: String.
The expression to be evaluated. It can contain arithmetic combinations of variables, constants, and (numeric) table columns, as well as numbers. Scientific notation is supported, e.g., 1.23e-4.

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: NUMERIC, DATE

Supports multiple columns: No

Sieve Generator

Config categories source

The output column is filled with values that are sampled from a given distribution and rejected if they do not satisfy a given condition. The process is repeated until a valid value is found. Example: [source,yaml]

    transformations:
      - columns: ["start"]
        params:
          type: date_generator
      - columns: ["finish"]
        params:
          type: sieve_generator
          condition: "finish > start + 1h"
          generator:
            type: date_generator

In case if the provided condition is too narrow, the generator may struggle to find a valid value. In such cases, it is recommended to adjust the condition or use another generator like arithmetic_generator.

Properties

type = sieve_generator

generator: Transformations.

condition: String.
The condition that the generated value must satisfy. It should contain the current column name and can include other columns and numbers. Scientific notation is supported, e.g., 1.23e-4. Complex conditions can be expressed using boolean algebra, e.g.: column1 > 0 && !(column2 < column1 || column3 >= column1).

max_discard_percentage: Number (double).
If number of discarded values exceeds given percentage of expected number of records, the generator will terminate the transformation. By default, no more than 2/3 of the expected number of records can be discarded.

is_condition_on_dates: optional Boolean.
Indicates if the condition is applied to date. If true, the generator will operate on time units (ms, s, m, h, d). If not set, the value will inferred from the output column type.

constants: List of named constants.

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: NUMERIC, DATE

Supports multiple columns: Yes

Categories source configuration

Used in: categories

optional Object.

Depending on value_source property value, can be one of the following:

PROVIDED

CSV_FILE

CSV file categories source

MULTIPLE_PROVIDED

Config categories source

MULTIPLE_CSV_FILE

CSV file categories source for multiple columns

Numeric type

Used in: numeric_type, numeric_type, numeric_type

optional String.
Type of numbers used by a generator

Enum values

INT
LONG
DOUBLE
FLOAT
BIG_DECIMAL
BIG_INTEGER
SHORT
BYTE
UNSIGNED_BYTE
UNSIGNED_INTEGER
UNSIGNED_LONG
UNSIGNED_SHORT

Foreign key distribution

Used in: distribution

String.
Distributions: POISSON (default) - Generates parent-child relations based on the Poisson distribution. Where lambda represents the ratio of parent to child count. ROUND_ROBIN - Assign children to parents using a round-robin algorithm. ORIGINAL - Preserves the original reference ratio (Note: This option may require more computational resources for recalculation of the model. It’s worth noting that this model may generate a distribution that does not perfectly match the original one).

Enum values

POISSON
ROUND_ROBIN
ORIGINAL

Parent data mode

Used in: parent_data_mode

optional String.
What part of parent data to consider for the child table processing. Default is ALL.

Enum values

NEW
OLD
ALL

Column transformation parameters

Used in: transformations

Object.
List of column names associated with Transformation parameters.

Properties

columns: array of String.
List of columns that are affected by this generator.

params: Transformations.

Hashing group

Used in: groups

Object.
The pair of selector and list of alphabet. selector is used to choose characters from the input string, alphabet - is a set of characters, which are used to replace source ones.

Properties

selector: Hashing group selector.

alphabets: array of Alphabet.

Format preserving hashing filter

Used in: filter

optional Object.

Depending on type property value, can be one of the following:

first

First N characters

last

Last N characters

characters

Specified characters

substring

Specified substring

regex

Regex filter

Alphabet

Used in: alphabets, alphabets

Object.

Depending on type property value, can be one of the following:

digits

Digits 0-9

lower_letters

English letters in lowercase [a-z]

upper_letters

English letters in UPPERCASE [A-Z]

custom

Custom alphabet

Value length exceeded mode

Used in: length_exceeded_mode, length_exceeded_mode, length_exceeded_mode

optional String.
Action, required on value length overflow. Modes: IGNORE - error if the value exceeds column length TRUNCATE (default) - truncate value to the field length

Enum values

IGNORE
TRUNCATE

Action

Used in: action

String.

Enum values

KEEP
MASK

Position

Used in: which

String.

Enum values

FIRST
LAST

Source of elements

Used in: source

optional Object.

Depending on value_source property value, can be one of the following:

PROVIDED

Element’s source provided from a configuration

CSV_FILE

Element’s source provided from a CSV file

MULTIPLE_PROVIDED

Multi columns element’s source provided from a configuration

JSON Pointer Transformer Specification

Used in: specifications

Object.

Properties

pointers: array of String.
JSON Pointer (specified by RFC6901)

transformation: Transformations.

ignore_errors: Boolean.
Controls the behaviour when no JSON node is found at the pointer or the node has a type incompatible with the specified transformer. If this setting is true, the found JSON node, if any, will remain unchanged. If the setting is false, an error will be raised. Default is false.

relative: Boolean.
The relative mode changes how JSON pointers work. Instead of always starting from the root, the pointer works from the leaf of a property. It doesn’t need to end at the root of the JSON. For example, the pointer /user_id will match multiple paths like: /user_id, /account/user_id, /customers/0/user_id. This means one pointer can match multiple locations, unlike the regular mode, which only allows a one-to-one match.

Xml XPath Transformer Specification

Used in: specifications

Object.

Properties

queries: array of String.
XPath (specified by https://www.w3.org/TR/xpath-31/)

transformation: Transformations.

ignore_errors: Boolean.
Controls the behaviour when no XML node is found at the xPath or the node has a type incompatible with the specified transformer. If this setting is true, the found XML node, if any, will remain unchanged. If the setting is false, an error will be raised. Default is false.

Scripting Language

Used in: language

String.

Enum values

JAVASCRIPT

Script

Used in: script, init_script

optional Object.
The script should define a lambda function to be called on every record or row depending on the chosen method. The init_script is executed once on start. It can be helpful for defining variables and functions which are available to use in the main script

Properties

code: optional String.
Script code

file: optional String.
Script file location. In the case of a local file system, the path can be absolute or relative to the application process’s working directory (not to be confused with working directory) The script can be located on local file system, AWS S3 and Google Storage.

To be able to load scripts from S3 the property TDK_AWS_ENABLED==true should be set. More details can be found here.

The property TDK_GCP_ENABLED==true allows loading scripts from Google Storage. More details can be found here.

ScriptingAdditionalProperties

Used in: additional_properties

optional map of String keys to`Object.`
Additional properties to be used in the script and init_script. The dictionary variable additionalProperties is available from scripts.

List of named variables

Used in: variables

optional map of String keys to`Transformations`.

List of named constants

Used in: constants, constants

optional map of String keys to`Number.`

Config categories source

Config source involves configuring categories and weights directly within the configuration.

Properties

value_source = PROVIDED

type: Dictionary data type.

values: Categories dictionary.

null_values: optional array of String.
The values that should be treated as NULL values. Default is ["null"]

CSV file categories source

Properties

value_source = CSV_FILE

type: Dictionary data type.

path: String.
The path to file on your file system.

format: Parser configuration.

null_values: optional array of String.
The values that should be treated as NULL values. Default is ["null"]

Config categories source

Config source involves configuring categories and weights for multiple columns directly within the configuration.

Properties

value_source = MULTIPLE_PROVIDED

column_types: Columns data types.

null_values: optional array of String.
The values that should be treated as NULL values. Default is ["null"]

category_values: array of Config categories source.

CSV file categories source for multiple columns

Lower-case alphabetic characters

Properties

value_source = MULTIPLE_CSV_FILE

column_types: Columns data types.

null_values: optional array of String.
The values that should be treated as NULL values. Default is ["null"]

path: String.
The path to file on your file system.

format: Parser configuration for multiple columns.

Hashing group selector

Used in: selector

Object.

Depending on type property value, can be one of the following:

digits

Digits 0-9

lower_letters

upper_letters

Upper-case alphabetic characters

regex

Custom regex pattern

word_characters

Word characters (equivalent of regex '\w+' as described at this link)

First N characters

Mask only first N characters of the input string

Properties

type = first

n: Integer (int32).

Last N characters

Mask only last N characters of the input string

Properties

type = last

n: Integer (int32).

Specified characters

Mask only specified characters of the input string

Properties

type = characters

characters: String.
ignore_case: Boolean.

Specified substring

Mask only specified substring of the input string

Properties

type = substring

substring: String.
ignore_case: Boolean.

Regex filter

Mask only characters filtered by regex

Properties

type = regex

pattern: String.
ignore_case: Boolean.

Digits 0-9

Properties

type = digits

English letters in lowercase [a-z]

Properties

type = lower_letters

English letters in UPPERCASE [A-Z]

Properties

type = upper_letters

Custom alphabet

Custom alphabet which can consist of characters, unicode blocks and unicode ranges. In total it can be from 1 to (2^16) characters.

Properties

type = custom

parts: optional array of Custom alphabet part.

Element’s source provided from a configuration

Source of elements

Properties

value_source = PROVIDED

column_type: Dictionary data type.

values: array of optional String.
The list of repeatable elements

Element’s source provided from a CSV file

Source of elements

Properties

value_source = CSV_FILE

path: String.
The path to file on your file system.

format: Parser configuration.

column_types: Columns data types.

null_values: optional array of String.
The values that should be treated as NULL values. Default is ["null"]

Multi columns element’s source provided from a configuration

Source of elements

Properties

value_source = MULTIPLE_PROVIDED

column_types: Columns data types.

values: array of Elements dictionary for multiple columns.

Dictionary data type

Used in: type, type, column_type

optional String.
Data type of categories. Modes: STRING (default) - Interpret values as strings BOOLEAN - Interpret values as booleans NUMERIC - Interpret values as doubles

Enum values

STRING
BOOLEAN
NUMERIC

Categories dictionary

Used in: values

map of String keys to`Number.`
The map can store keys in one of three formats: STRING, BOOLEAN or NUMERIC. The weights must be equal to or greater than zero.

Example of formats:

---- "23": 123 "owl": 534 "true": 532 ----

To disable one of the categories, simply set its weight value to zero.

Parser configuration

Used in: format

optional Object.

Properties

encoding: String.
This property enables to specify an encoding format.

delimiter: String.
You have the option to specify a custom delimiter.

trim: Boolean.
The trim option removes leading and trailing spaces from each cell.

columns: Column value extract method.

Columns data types

Used in: column_types, column_types, column_types, column_types

optional map of String keys to`Dictionary data type`.

Config categories source

Used in: category_values

Object.
The values for multiple columns of single category.

Properties

values: Categories dictionary for multiple columns.

weight: Number.

Parser configuration for multiple columns

Used in: format

optional Object.

Properties

encoding: String.
This property enables to specify an encoding format.

delimiter: String.
You have the option to specify a custom delimiter.

trim: Boolean.
The trim option removes leading and trailing spaces from each cell.

columns: Column value extract method for multiple columns.

Digits 0-9

Properties

type = digits

Lower-case alphabetic characters

Properties

type = lower_letters

Upper-case alphabetic characters

Properties

type = upper_letters

Custom regex pattern

Properties

type = regex

pattern: String.

Word characters (equivalent of regex '\w+' as described at this link)