Transformations

Transformations

optional Object.
Parameters of a transformation. All parameters have a type key with the type name of the transformation, and other parameters that are transformation-specific.

Depending on type property value, can be one of the following:

Key

Link

Modes

Data types

Multiple columns

categorical_generator

GENERATION, MASKING, KEEP

ANY

Yes

conditional_generator

GENERATION, MASKING, KEEP

ANY

Yes

continuous_generator

GENERATION, MASKING, KEEP

NUMERIC

No

quantile_generator

GENERATION, MASKING, KEEP

NUMERIC

No

foreign_key_generator

GENERATION, MASKING, KEEP

ANY

Yes

unique_generator

GENERATION, MASKING, KEEP

ANY

Yes

format_preserving_hashing

MASKING, KEEP

TEXT

No

formatted_string_generator

GENERATION, MASKING, KEEP

ANY

No

int_sequence_generator

GENERATION, MASKING, KEEP

NUMERIC

No

string_sequence_generator

GENERATION, MASKING, KEEP

TEXT

No

noising

MASKING, KEEP

NUMERIC

No

null_generator

GENERATION, MASKING, KEEP

ANY

Yes

passthrough

MASKING, KEEP

ANY

Yes

person_generator

GENERATION, MASKING, KEEP

TEXT

Yes

address_generator

GENERATION, MASKING, KEEP

TEXT

Yes

finance_generator

GENERATION, MASKING, KEEP

TEXT

Yes

redaction

MASKING, KEEP

TEXT

No

unique_hashing

MASKING, KEEP

NUMERIC

No

date_time_unique_hashing

MASKING, KEEP

DATE

No

date_generator

GENERATION, MASKING, KEEP

DATE

No

uuid_generator

GENERATION, MASKING, KEEP

ANY

No

constant_numeric

GENERATION, MASKING, KEEP

NUMERIC

No

constant_string

GENERATION, MASKING, KEEP

TEXT

No

constant_date

GENERATION, MASKING, KEEP

DATE

No

constant_boolean

GENERATION, MASKING, KEEP

BOOLEAN

No

loop_generator

GENERATION, MASKING, KEEP

ANY

Yes

constant_xml

GENERATION, MASKING, KEEP

ANY

No

json_pointer_transformer

MASKING, KEEP

ANY

No

xpath_transformer

MASKING, KEEP

ANY

No

void_generator

GENERATION, MASKING, KEEP

ANY

Yes

scripting_transformer

GENERATION, MASKING, KEEP

ANY

Yes

Categorical generator

Randomly sample from a given key / value of categories and weights. Weights and categories can be provided from various datasources or learned from data.

The following example demonstrates how to accomplish this:

    transformations:
      - columns:
          - transaction_type
        params:
          type: categorical_generator
          categories:
            value_source: PROVIDED
            values:
              "sent": 0.5
              "received": 0.4
              "skipping_value": 0.0
              "null": 0.1

You can also specify a CSV file as the source of categories. The following example illustrates the complete configuration example:

    transformations:
      - columns:
          - transaction_type
        params:
          type: categorical_generator
          categories:
            value_source: CSV_FILE
            path: src/e2e/resources/data_with_header.csv
            null_values: ["null", ""]
            format:
              columns:
                column_accessor_type: NAME
                categories: "title"
                weights: "rank"
              encoding: "UTF-8"
              delimiter: ","
              trim: true

For simple scenarios, the minimal configuration may be useful, which relies on the defaults mentioned earlier. The following example illustrates this approach:

    transformations:
      - columns:
          - transaction_type
        params:
          type: categorical_generator
          categories:
            value_source: CSV_FILE
            path: src/e2e/resources/data.csv

This transformation can be applied to several columns at once. This illustrates how the generator can handle multiple columns with provided tuples of categories that always appear together:

    transformations:
      - columns:
          - productcode
          - productname
        params:
          type: categorical_generator
          categories:
            value_source: MULTIPLE_PROVIDED
            null_values: ["nil"]
            category_values:
              - values:
                  productcode: "P1"
                  productname: "Product 1"
                weight: 0.5
              - values:
                  productcode: "P2"
                  productname: "Product 2"
                weight: 0.3
              - values:
                  productcode: "P3"
                  productname: "Product 3"
                weight: 0.2
              - values:
                  productcode: "nil"
                  productname: "nil"
                weight: 0.5

You can also specify a CSV file as the source of categories for multiple columns.

    transformations:
      - columns:
          - productcode
          - productname
        params:
          type: categorical_generator
          categories:
            value_source: MULTIPLE_CSV_FILE
            path: src/e2e/resources/data_multi.csv

Example of advanced configuration with multiple columns and multiple categories:

    transformations:
      - columns:
          - productcode
          - productname
        params:
          type: categorical_generator
          categories:
            value_source: MULTIPLE_CSV_FILE
            path: src/e2e/resources/data_with_header_multi.csv
            null_values: ["null", ""]
            format:
              columns:
                column_accessor_type: NAME
                categories:
                  productcode: "code"
                  productname: "name"
                weights: "rank"
              encoding: "UTF-8"
              delimiter: ","
              trim: true
This generator normalizes weights to the probability interval. If the sum of weights exceeds the capacity of the double format, you need to use weights with a smaller scale factor.

Properties

  • type = categorical_generator

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: Yes

Conditional generator

Uses one of two transformations/generators depending on the value of the given field of the parent table. For example, using conditional generator one may use different generators depending on the value of "gender" column of the parent table.

Example:

    transformations:
      - columns: [ "status" ]
        params:
          type: conditional_generator
          conditional_table: "public.delivery"
          conditional_column: "status"
          conditional_value: "DONE"
          if_true:
            type: constant_string
            value: "CLOSED"
          if_false:
            type: constant_string
            value: "OPEN"

Properties

  • type = conditional_generator

  • conditional_column: String.
    Parent column.

  • conditional_table: optional String.
    Parent table.

  • conditional_value: String.
    Value to be compared with. If the value of the parent column is equal to conditional_value, then if_true generator is used, otherwise if_false generator.

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: Yes

Continuous generator

Output data is sampled from a parameterized continuous distribution. If parameters are not given, they will be fitted from the original data.

Example:

    transformations:
      - columns:
          - "amount"
        params:
          type: continuous_generator
          mean: 354.21
          std: 98.96
          min: 0.0

Properties

  • type = continuous_generator

  • mean: optional Number (double).
    Mean of the sampled distribution

  • std: optional Number (double).
    Standard Deviation

  • min: optional Number (double).
    Minimum value

  • max: optional Number (double).
    Maximum value

  • round: Integer.
    If given, output data will be rounded to this number of digits

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: NUMERIC

Supports multiple columns: No

Quantile generator

Given a list of probabilities and bin edges, the output data is sampled from a mixture of uniform distributions, where each uniform distribution i is chosen with probability probabilities[i] and its edges are given by bin_edges[i] and bin_edges[i + 1]. If parameters are not given, they will be fitted from the original data.

Example:

    transformations:
      - columns: ["amount"]
        params:
          type: quantile_generator
          hist: [0.5, 0.2, 0.3]
          bin_edges: [0.1, 0.15, 0.3, 0.45]
          numeric_type: DOUBLE

Properties

  • type = quantile_generator

  • hist: optional array of Number (double).
    Probabilities of each uniform distribution.

  • bin_edges: optional array of Number (double).
    Bin edges of each uniform distribution.

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: NUMERIC

Supports multiple columns: No

Foreign key generator

Fills columns with the parent table’s primary key values of a random row.

Normally this generator is being created implicitly wherever data generation for tables related with foreign keys is needed, so it should not be explicitly set up by the user.

Properties

  • type = foreign_key_generator

  • referred_schema: optional String.

  • referred_table: optional String.

  • referred_fields: optional array of String.

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: Yes

Unique generator

This generator is intended for the case where primary key values are part of the foreign key.

Normally this generator is being created implicitly wherever data generation for tables related with foreign keys is needed, so it should not be explicitly set up by the user.

Properties

  • type = unique_generator

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: Yes

Format preserving hashing

A hash transformation is applied to each character, which included into the configured group, in a given text so that the output preserves the format but contains different characters. This transformation is secure and non-reversible.

Examples:

Default configuration:

    transformations:
      - columns: ["registration_number"]
        params:
          type: format_preserving_hashing
          groups:
            - selector:
                type: digits
              alphabets:
                - type: digits
            - selector:
                type: lower_letters
              alphabets:
                - type: lower_letters
            - selector:
                type: upper_letters
              alphabets:
                - type: upper_letters

Mask only last 5 characters:

    transformations:
      - columns: ["registration_number"]
        params:
          type: format_preserving_hashing
          filter:
            type: "last"
            n: 5

Mask only substring ignoring case:

    transformations:
      - columns: ["registration_number"]
        params:
          type: format_preserving_hashing
          filter:
            type: substring
            substring: sub
            ignore_case: true

Mask only a set of characters ignoring case:

    transformations:
      - columns: ["registration_number"]
        params:
          type: format_preserving_hashing
          filter:
            type: characters
            characters: "abc"
            ignore_case: true

Mask characters selected by regex with a custom alphabet:

    transformations:
      - columns: ["phone_number"]
        params:
          type: format_preserving_hashing
          groups:
            - selector:
                type: regex
                pattern: "[123]"
              alphabets:
                - type: custom
                  parts:
                    - type: characters
                      characters: "456"
                    - type: characters
                      characters: "789"
                    - type: unicode_block
                      name: LATIN_EXTENDED_D
                    - type: unicode_block
                      name: "Latin Extended-A"
                    - type: unicode_range
                      from: 0x0D00
                      to: 0x0D7F

Properties

  • type = format_preserving_hashing

Hashing groups to apply on top of the specified filter. There can be multiple groups configured. In that case the groups will be tried to match a region within the filtered value in the order they are specified in configuration. If a match is successfully found, the corresponding group’s alphabet will be used for transformation, and no other groups will be tried for that region. This implies that most specific hashing groups must be specified first in the configuration. Unspecified parameter or null is equivalent to the following:

    transformations:
      - columns: ["registration_number"]
        params:
          type: format_preserving_hashing
          groups:
            - selector:
                type: digits
              alphabets:
                - type: digits
            - selector:
                type: lower_letters
              alphabets:
                - type: lower_letters
            - selector:
                type: upper_letters
              alphabets:
                - type: upper_letters
            - selector:
                type: word_characters
              alphabets:
                - type: digits
                - type: lower_letters
                - type: upper_letters
  • locale: optional String.
    To generate a string sequence with letters from different alphabets, the user can change this parameter when a hashing group is not explicitly specified. The default value is en-GB, representing the Latin alphabet.

    Supported locales:

    • en, ca, cs, da, de, es, fi, fr, hu, in, it, nb, nl,pl, pt, sk, sv, tr, vi - The Latin alphabet characters

    • ar, fa - The ARABIC Unicode block

    • bg, ru, uk - The CYRILLIC Unicode block

    • he - The HEBREW Unicode block

    • ja - The HIRAGANA, KATAKANA, CJK_UNIFIED_IDEOGRAPHS Unicode block

    • ko - The HANGUL_JAMO Unicode block

    • zh - The CJK_UNIFIED_IDEOGRAPHS Unicode block

  • length_threshold: optional Integer (int32).
    The length threshold at which the masker is applicable. If any value in the column exceeds this length, execution will not start and will raise a validation error.

Compatible modes: MASKING, KEEP

Compatible column data types: TEXT

Supports multiple columns: No

Formatted string generator

Generate a string column based on a given pattern. If the pattern is not given, will generate random characters with similar length as original column.

Example:

    transformations:
      - columns:
          - "phone_number"
        params:
          type: formatted_string_generator
          pattern: "\\+44[0-9]{10}"

Properties

  • type = formatted_string_generator

  • pattern: optional String.
    Regular expression pattern used to sample data from

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: No

Integers sequence generator

Generate a sequence of integers that represent a unique id column that contain unique values.

Properties

  • type = int_sequence_generator

  • start_from: optional Integer.
    Where to start the sequence from, default to 0. If the generator is used on existing data, this should be used as the maximum of the existing data plus 1.

Example:

    transformations:
      - columns:
          - "user_id"
        params:
          type: int_sequence_generator

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: NUMERIC

Supports multiple columns: No

String sequence generator

Generate a sequence of strings that represent a unique id column that contain unique values, including uppercase alphabetic and numeric values.

Example:

    transformations:
      - columns: ["country_id"]
        params:
          type: string_sequence_generator

Properties

  • type = string_sequence_generator

  • length: optional Integer.
    Maximum length of the column, extracted from the database DDL if not given

  • start_from: optional String.
    Where to start the sequence from, default empty string. If the generator is used on existing data, this should be used as the maximum of the existing data with shift.

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: TEXT

Supports multiple columns: No

Noising transformation

Add laplacian noise to the input column in order to protect the privacy but output similar values.

Example:

    transformations:
      - columns: ["product_price"]
        params:
          type: noising
          sensitivity: 23.47
          min: 0

Properties

  • type = noising

  • sensitivity: optional Number (double).
    Amount of noise to be added

  • min: optional Number (double).
    If there’s a hard minimum, transformation will truncate output values there if smaller

  • max: optional Number (double).
    If there’s a hard maximum, transformation will truncate output values there if greater

Compatible modes: MASKING, KEEP

Compatible column data types: NUMERIC

Supports multiple columns: No

Null generator

The output column is filled with null values

Example:

    transformations:
      - columns: ["empty_column"]
        params:
          type: null_generator

Properties

  • type = null_generator

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: Yes

Passthrough transformation

The output data is equal to the input, no transformation is applied.

Example:

    transformations:
      - columns: ["customer_number", "plate"]
        params:
          type: passthrough

Properties

  • type = passthrough

Compatible modes: MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: Yes

Person generator

Generate personal fields (e.g., name, surname, title) and keep them consistent across columns.

Available templates are:

  • ${email}

  • ${title}

  • ${first_name}

  • ${male_first_name}

  • ${female_first_name}

  • ${last_name}

  • ${full_name}

  • ${username}

  • ${company}

  • ${phone_national}

  • ${phone_international}

  • ${ssn}

Supported locales:

ar

bg

ca

ca-CAT

cs

da-DK

de

de-AT

de-CH

en

en-AU

en-CA

en-GB

en-IND

en-MS

en-NEP

en-NG

en-NZ

en-PAK

en-SG

en-UG

en-US

en-ZA

en-PH

es

es-MX

fa

fi-FI

fr

he

hu

in-ID

it

ja

ko

nb-NO

nl

pl

pt

pt-BR

ru

sk

sv

sv-SE

tr

uk

vi

zh-CN

zh-TW

Example for several columns:

    transformations:
      - columns: ["first_name", "last_name"]
        params:
          type: person_generator
          column_templates: ["${first_name}", "${last_name}"]

Example for a single column:

    transformations:
      - columns: ["full_name"]
        params:
          type: person_generator
          column_templates: ["${first_name} ${last_name}"]

Properties

  • type = person_generator

  • column_templates: array of String.
    For each column, the template to be used to generate personal data

  • consistent_with_column: String.
    If given, the column that need to be consistent on. For example, if consistent_with_column="user_id" all people with same user_id will have the same name. The "self" value means consistency with the source value.

  • locale: optional String.
    To generate names from different geographical areas, the user can change this parameter. Default to 'en-GB', which corresponds to British names.

  • column_lengths: optional array of Integer.
    Max lengths allowable for the column. Ignored when length_exceeded_mode: "IGNORE"`

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: TEXT

Supports multiple columns: Yes

Address generator

Generate address fields (e.g., street, zip code) and keep them consistent across columns. Available templates are:

  • ${zip_code}

  • ${country}

  • ${city}

  • ${street_name}

  • ${house_number}

  • ${flat_number}

  • ${full_address}

  • ${street_address}

  • ${region}

  • ${latitude}

  • ${longitude}

  • ${coordinates}

  • ${time_zone}

Supported locales:

ar

bg

ca

ca-CAT

cs

da-DK

de

de-AT

de-CH

en

en-AU

en-CA

en-GB

en-IND

en-MS

en-NEP

en-NG

en-NZ

en-PAK

en-SG

en-UG

en-US

en-ZA

en-PH

es

es-MX

fa

fi-FI

fr

he

hu

in-ID

it

ja

ko

nb-NO

nl

pl

pt

pt-BR

ru

sk

sv

sv-SE

tr

uk

vi

zh-CN

zh-TW

Example for several columns:

    transformations:
      - columns: [ "street_name", "zip_code" ]
        params:
          type: address_generator
          column_templates: [ "${street_name}", "${zip_code}" ]

Example for a single column:

    transformations:
      - columns: ["address"]
        params:
          type: address_generator
          column_templates: ["${country}, ${city}, ${street_name}, ${house_number}, ${flat_number}, ${zip_code}"]

Properties

  • type = address_generator

  • column_templates: array of String.
    For each column, the template to be used to generate address data

  • consistent_with_column: String.
    If given, the column that need to be consistent on. For example, if consistent_with_column="user_id" all people with same user_id will have the same street. The "self" value means consistency with the source value.

  • locale: optional String.
    To generate addresses from different geographical areas, the user can change this parameter. Default to 'en-GB', which corresponds to Great Britain addresses.

  • column_lengths: optional array of Integer.
    Max lengths allowable for the column. Ignored when length_exceeded_mode: "IGNORE"`

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: TEXT

Supports multiple columns: Yes

Finance generator

Generate financial data.

Available templates:

  • ${credit_card}

  • ${bic}

  • ${iban}

  • ${nasdaq_ticker}

  • ${nyse_ticker}

  • ${stock_market}

  • ${us_routing_number}

The template credit_card (without the card type qualification) will result in a random type being picked.

The credit card type can be configured using the following templates:

  • ${credit_card.visa}

  • ${credit_card.mastercard}

  • ${credit_card.discover}

  • ${credit_card.american_express}

  • ${credit_card.diners_club}

  • ${credit_card.jcb}

  • ${credit_card.switch}

  • ${credit_card.solo}

  • ${credit_card.dankort}

  • ${credit_card.forbrugsforeningen}

  • ${credit_card.laser}

Example:

    transformations:
      - columns: [ "credit_card" ]
        params:
          type: finance_generator
          column_templates: [ "${credit_card.visa}" ]

Properties

  • type = finance_generator

  • column_templates: array of String.
    For each column, the template to be used to generate financial data

  • consistent_with_column: String.
    If given, the column that need to be consistent on. The "self" value means consistency with the source value.

  • column_lengths: optional array of Integer.
    Max lengths allowable for the column. Ignored when length_exceeded_mode: "IGNORE"`

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: TEXT

Supports multiple columns: Yes

Redaction masker

Some values in the input string are substituted by the same value, obtaining partially masked text in the output.

Example:

    transformations:
      - columns: ["credit_card"]
        params:
          type: redaction
          action: MASK
          which: FIRST
          count: 4
          mask_with: "#"

Properties

  • type = redaction

  • count: Integer.
    amount of characters to be masked or kept, default to 4

  • mask_with: String.
    character used to mask values, default to *

Compatible modes: MASKING, KEEP

Compatible column data types: TEXT

Supports multiple columns: No

Unique hashing

Apply a hash transformation to a given value so that the output is encrypted but structural coherence is preserved (same input hashed with same key is always going to produce same output). Output values are unique.

This transformation is applied to primary and foreign keys by default in MASKING mode.

Example:

    transformations:
      - columns: ["card_id"]
        params:
          type: unique_hashing

Properties

  • type = unique_hashing

  • max_value: Number (double).
    Max value to generate, null means absence of limit

  • precision: Integer.
    Max precision to generate (e.g. if the value is 3, the maximal value is 999), null means absence of limit. Minimal value is applied if both max_value and precision are specified

Compatible modes: MASKING, KEEP

Compatible column data types: NUMERIC

Supports multiple columns: No

Date unique hashing

Apply a hash transformation to a date time format value so that the output is encrypted but structural coherence is preserved (same input hashed with same key is always going to produce same output). Output values are unique.

This transformation is applied to primary and foreign keys by default in MASKING mode.

Example:

    transformations:
      - columns: ["create_date"]
        params:
          type: date_time_unique_hashing
          min: 2000-01-01T12:00:00Z
          max: 2022-01-01T12:00:00Z

Properties

  • type = date_time_unique_hashing

  • min: optional String (date-time).
    Minimum value

  • max: optional String (date-time).
    Maximum value

Compatible modes: MASKING, KEEP

Compatible column data types: DATE

Supports multiple columns: No

Date generator

Output data is sampled from a parameterized continuous distribution, and transformed into dates. If parameters are not given, they will be extracted from the original data

Example:

    transformations:
      - columns:
          - "date_of_birth"
        params:
          type: date_generator
          mean: 2018-02-01T12:00:00Z
          std: 2d 4h 45m 12s 434ms
          min: 2000-01-01T12:00:00Z
          max: 2022-01-01T12:00:00Z

Properties

  • type = date_generator

  • mean: optional String (date-time).
    Average date of the sampled distribution

  • std: optional String.
    Standard deviation. The following formats are accepted:

    • ISO-8601 Duration format, e.g., P1DT2H3M4.058S.

    • The concise format described here, e.g., 10s, 1h 30m or -(1h 30m)

    • Milliseconds without the specific unit, e.g., 12534.

  • min: optional String (date-time).
    Minimum value

  • max: optional String (date-time).
    Maximum value

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: DATE

Supports multiple columns: No

UUID generator

The output column is filled with UUIDs.

Example:

    transformations:
      - columns: ["unique_id"]
        params:
          type: uuid_generator

Properties

  • type = uuid_generator

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: No

Constant numeric generator

Generates a single numeric value for the entire column

Example:

    transformations:
      - columns: [ "balance" ]
        params:
          type: constant_numeric
          value: 0.0

Example (range):

    transformations:
      - columns: [ "balance" ]
        params:
          type: constant_numeric
          min: 0.0
          max: 10000.0

Properties

  • type = constant_numeric

  • value: optional Number.
    numeric value to generate

  • min: optional Number.
    The lower boundary for the value (inclusive)

  • max: optional Number.
    The upper boundary for the value (exclusive)

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: NUMERIC

Supports multiple columns: No

Constant string generator

Generates a single string value for the entire column

Example:

    transformations:
      - columns: [ "status" ]
        params:
          type: constant_string
          value: "ACTIVE"

Properties

  • type = constant_string

  • value: optional String.
    string value to generate

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: TEXT

Supports multiple columns: No

Constant date generator

Generates a single date value for the entire column

Example:

    transformations:
      - columns: [ "creation_date" ]
        params:
          type: constant_date
          value: 2022-07-28T12:21:00Z

Example (range):

    transformations:
      - columns: [ "creation_date" ]
        params:
          type: constant_date
          min: 2022-07-01T00:00:00Z
          max: 2022-07-31T23:59:59Z

Properties

  • type = constant_date

  • value: optional String (date-time).
    date value to generate

  • min: optional String (date-time).
    The lower boundary for the value (inclusive)

  • max: optional String (date-time).
    The upper boundary for the value (exclusive)

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: DATE

Supports multiple columns: No

Constant boolean generator

Generates a single boolean value for the entire column

Example:

    transformations:
      - columns: [ "is_active" ]
        params:
          type: constant_boolean
          value: true

Properties

  • type = constant_boolean

  • value: optional Boolean.
    boolean value to generate

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: BOOLEAN

Supports multiple columns: No

Loop generator

Generates a sequence of elements in a loop that can be repeated. Data can be obtained from various sources or generated based on existing data. The following example demonstrates how to accomplish this:

    transformations:
      - columns:
          - transaction_type
        params:
          type: "loop_generator"
          repeatable: true
          source:
            value_source: "PROVIDED"
            values:
              - "sent"
              - "skipping_value"
              - "received"
              - null

This transformation can be applied to several columns at once. This illustrates how the generator can handle multiple columns with provided tuples of values that always appear together:

    transformations:
      - columns:
          - productcode
          - productname
        params:
          type: "loop_generator"
          repeatable: true
          source:
            value_source: "MULTIPLE_PROVIDED"
            values:
              - productcode: "P1"
                productname: "Product 1"
              - productcode: "P2"
                productname: "Product 2"
              - productcode: "P3"
                productname: "Product 3"
              - productcode: null
                productname: null

You can also specify a CSV file as the source of elements. The following example illustrates the complete configuration example:

    transformations:
      - columns:
          - productcode
          - productname
        params:
          type: "loop_generator"
          repeatable: true
          source:
            value_source: "CSV_FILE"
            path: src/e2e/resources/data_with_header_multi.csv
            null_values: null
            format:
              encoding: "UTF-8"
              delimiter: ","
              trim: true
              columns:
                column_accessor_type: NAME
                names: ["code", "name"]

Properties

  • type = loop_generator

  • repeatable: Boolean.
    The list will repeat itself in a loop if necessary. Otherwise, an exception will be thrown when the record number exceeds the size of the list.

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: Yes

Constant XML generator

Generates a single XML value for the entire column Example:

transformations:
  - columns: [ "xml_column" ]
    params:
      type: constant_xml
      value: "<root>test</root>"

Properties

  • type = constant_xml

  • value: String.
    XML string value to generate

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: No

JSON Pointer Transformer

Transforms JSON value nodes indicated by JSON pointers. The rest of the values are kept as is.

Example:

    transformations:
      - columns: ["productspec"]
        params:
          type: "json_pointer_transformer"
          specifications:
            - pointers: [ "/sku" ]
              transformation:
                type: "format_preserving_hashing"
            - pointers: [ "/tags/0" ]
              transformation:
                type: "format_preserving_hashing"
              ignore_errors: true

Properties

  • type = json_pointer_transformer

Compatible modes: MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: No

Xml XPath Transformer

Transforms XML value nodes indicated by XML XPath. The rest of the values are kept as is.

Example:

transformations:
  - columns: ["productspec"]
    params:
      type: "xpath_transformer"
      specifications:
        - queries: [ "/sku" ]
          transformation:
            type: "format_preserving_hashing"

Properties

  • type = xpath_transformer

  • encoding: String.
    This property enables to specify an encoding format.

Compatible modes: MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: No

Void Generator

An auxiliary transformer that throws an error when called. It is used only when it is necessary to ignore the processing of the entire table.

Properties

  • type = void_generator

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: Yes

Scripting Transformer

The scripting transformer allows you to implement own logic for both the GENERATION mode and the MASKING mode. Currently only Javascript implementation is supported. The script for GENERATION mode must define a lambda function that returns a dictionary where keys are column names, and the values are the desired values of the record. If a transformer is applied to a single column, the value may be returned instead of a dictionary. The script for the MASKING mode must define a lambda function with the two arguments ctx and originalRecord. The following example shows how to use a custom script for multiple columns and MASKING mode.

    transformations:
      - columns:
          - textdescription
          - htmldescription
        params:
          type: scripting_transformer
          language: "JAVASCRIPT"
          script:
            code: |
              /**
               * @typedef { Object.<string, *> | * } Result
               *
               * @param {GenerationContext} ctx
               * @param {Record} originalRecord
               * @returns {Result}
               */
              (ctx, originalRecord) => {
                const dict = originalRecord.asMap();
                const textDescriptionColumn = columns.get(0);
                const htmlDescriptionColumn = columns.get(1);
                const descriptionWithoutSpaces = dict.get(textDescriptionColumn).trim();
                return { [textDescriptionColumn]: descriptionWithoutSpaces, [htmlDescriptionColumn]: descriptionWithoutSpaces };
              }

The script for the GENERATION mode should define a lambda function with the single argument ctx. The following example shows how to use a custom script for GENERATION:

    transformations:
      - columns:
          - credit_card
        params:
          type: scripting_transformer
          language: "JAVASCRIPT"
          additional_properties:
            first_credit_card_digit: 4
          init_script:
            code: |
              /**
               * @returns {String}
               */
              function generateRandomCreditCardNumber() {
                let creditCardNumber = additionalProperties["first_credit_card_digit"]

                for (let i = 1; i < 16; i++) {
                  const digit = Math.floor(Math.random() * 10);
                  creditCardNumber += digit.toString();
                }

                return creditCardNumber;
              }
          script:
            code: |
              /**
               * @typedef { Object.<string, *> | * } Result
               *
               * @param {GenerationContext} ctx
               * @returns {Result}
               */
              (ctx) => generateRandomCreditCardNumber();

Properties

  • type = scripting_transformer

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: Yes

Categories source configuration

Used in: categories

optional Object.

Depending on value_source property value, can be one of the following:

PROVIDED

CSV_FILE

MULTIPLE_PROVIDED

MULTIPLE_CSV_FILE

Numeric type

optional String.
Type of numbers used by a generator

Enum values
  • INT

  • LONG

  • DOUBLE

  • FLOAT

  • BIG_DECIMAL

  • BIG_INTEGER

  • SHORT

  • BYTE

  • UNSIGNED_BYTE

  • UNSIGNED_INTEGER

  • UNSIGNED_LONG

  • UNSIGNED_SHORT

Foreign key distribution

Used in: distribution

String.
Distributions: POISSON (default) - Generates parent-child relations based on the Poisson distribution. Where lambda represents the ratio of parent to child count. ROUND_ROBIN - Assign children to parents using a round-robin algorithm. ORIGINAL - Preserves the original reference ratio (Note: This option may require more computational resources for recalculation of the model. It’s worth noting that this model may generate a distribution that does not perfectly match the original one).

Enum values
  • POISSON

  • ROUND_ROBIN

  • ORIGINAL

Parent data mode

Used in: parent_data_mode

optional String.
What part of parent data to consider for the child table processing. Default is ALL.

Enum values
  • NEW

  • OLD

  • ALL

Column transformation parameters

Used in: transformations

Object.
List of column names associated with Transformation parameters.

Properties

  • columns: array of String.
    List of columns that are affected by this generator.

Hashing group

Used in: groups

Object.
The pair of selector and list of alphabet. selector is used to choose characters from the input string, alphabet - is a set of characters, which are used to replace source ones.

Properties

Format preserving hashing filter

Used in: filter

optional Object.

Depending on type property value, can be one of the following:

first

last

characters

substring

regex

Alphabet

Used in: alphabets, alphabets

Object.

Depending on type property value, can be one of the following:

digits

lower_letters

upper_letters

custom

Value length exceeded mode

optional String.
Action, required on value length overflow. Modes: IGNORE - error if the value exceeds column length TRUNCATE (default) - truncate value to the field length

Enum values
  • IGNORE

  • TRUNCATE

Action

Used in: action

String.

Enum values
  • KEEP

  • MASK

Position

Used in: which

String.

Enum values
  • FIRST

  • LAST

Source of elements

Used in: source

optional Object.

Depending on value_source property value, can be one of the following:

PROVIDED

CSV_FILE

MULTIPLE_PROVIDED

JSON Pointer Transformer Specification

Used in: specifications

Properties

  • pointers: array of String.
    JSON Pointer (specified by RFC6901)

  • ignore_errors: Boolean.
    Controls the behaviour when no JSON node is found at the pointer or the node has a type incompatible with the specified transformer. If this setting is true, the found JSON node, if any, will remain unchanged. If the setting is false, an error will be raised. Default is false.

Xml XPath Transformer Specification

Used in: specifications

Properties

  • ignore_errors: Boolean.
    Controls the behaviour when no XML node is found at the xPath or the node has a type incompatible with the specified transformer. If this setting is true, the found XML node, if any, will remain unchanged. If the setting is false, an error will be raised. Default is false.

Scripting Language

Used in: language

String.

Enum values
  • JAVASCRIPT

Script

Used in: script, init_script

optional Object.
The script should define a lambda function to be called on every record or row depending on the chosen method. The init_script is executed once on start. It can be helpful for defining variables and functions which are available to use in the main script

Properties

  • code: optional String.
    Script code

  • file: optional String.
    Script file location. In the case of a local file system, the path can be absolute or relative to the application process’s working directory (not to be confused with working directory) The script can be located on local file system, AWS S3 and Google Storage.

To be able to load scripts from S3 the property TDK_AWS_ENABLED==true should be set. More details can be found here.

The property TDK_GCP_ENABLED==true allows loading scripts from Google Storage. More details can be found here.

ScriptingAdditionalProperties

optional map of String keys to

Properties

  • value: Object.
    Additional properties to be used in the script and init_script. The dictionary variable additionalProperties is available from scripts.

Config categories source

Config source involves configuring categories and weights directly within the configuration.

Properties

  • value_source = PROVIDED

  • null_values: optional array of String.
    The values that should be treated as NULL values. Default is ["null"]

CSV file categories source

Properties

  • value_source = CSV_FILE

  • path: String.
    The path to file on your file system.

  • null_values: optional array of String.
    The values that should be treated as NULL values. Default is ["null"]

Config categories source

Config source involves configuring categories and weights for multiple columns directly within the configuration.

Properties

  • value_source = MULTIPLE_PROVIDED

  • null_values: optional array of String.
    The values that should be treated as NULL values. Default is ["null"]

CSV file categories source for multiple columns

Properties

  • value_source = MULTIPLE_CSV_FILE

  • null_values: optional array of String.
    The values that should be treated as NULL values. Default is ["null"]

  • path: String.
    The path to file on your file system.

Hashing group selector

Used in: selector

Object.

Depending on type property value, can be one of the following:

digits

lower_letters

upper_letters

regex

word_characters

First N characters

Mask only first N characters of the input string

Properties

  • type = first

  • n: Integer (int32).

Last N characters

Mask only last N characters of the input string

Properties

  • type = last

  • n: Integer (int32).

Specified characters

Mask only specified characters of the input string

Properties

  • type = characters

  • characters: String.

  • ignore_case: Boolean.

Specified substring

Mask only specified substring of the input string

Properties

  • type = substring

  • substring: String.

  • ignore_case: Boolean.

Regex filter

Mask only characters filtered by regex

Properties

  • type = regex

  • pattern: String.

  • ignore_case: Boolean.

Digits 0-9

Properties

  • type = digits

English letters in lowercase [a-z]

Properties

  • type = lower_letters

English letters in UPPERCASE [A-Z]

Properties

  • type = upper_letters

Custom alphabet

Custom alphabet which can consist of characters, unicode blocks and unicode ranges. In total it can be from 1 to (2^16) characters.

Properties

  • type = custom

Element’s source provided from a configuration

Properties

  • value_source = PROVIDED

  • values: array of optional String.
    The list of repeatable elements

Element’s source provided from a CSV file

Properties

  • value_source = CSV_FILE

  • path: String.
    The path to file on your file system.

  • null_values: optional array of String.
    The values that should be treated as NULL values. Default is ["null"]

Multi columns element’s source provided from a configuration

Properties

  • value_source = MULTIPLE_PROVIDED

Dictionary data type

Used in: type, type, column_type

optional String.
Data type of categories. Modes: STRING (default) - Interpret values as strings BOOLEAN - Interpret values as booleans NUMERIC - Interpret values as doubles

Enum values
  • STRING

  • BOOLEAN

  • NUMERIC

Categories dictionary

Used in: values

map of String keys to`Number.`
The map can store keys in one of three formats: STRING, BOOLEAN or NUMERIC. The weights must be equal to or greater than zero.

Example of formats:

---- "23": 123 "owl": 534 "true": 532 ----

To disable one of the categories, simply set its weight value to zero.

Parser configuration

Used in: format

optional Object.

Properties

  • encoding: String.
    This property enables to specify an encoding format.

  • delimiter: String.
    You have the option to specify a custom delimiter.

  • trim: Boolean.
    The trim option removes leading and trailing spaces from each cell.

Columns data types

optional map of String keys to`Dictionary data type`.

Config categories source

Used in: category_values

The values for multiple columns of single category.

Properties

  • weight: Number.

Parser configuration for multiple columns

Used in: format

optional Object.

Properties

  • encoding: String.
    This property enables to specify an encoding format.

  • delimiter: String.
    You have the option to specify a custom delimiter.

  • trim: Boolean.
    The trim option removes leading and trailing spaces from each cell.

Digits 0-9

Properties

  • type = digits

Lower-case alphabetic characters

Properties

  • type = lower_letters

Upper-case alphabetic characters

Properties

  • type = upper_letters

Custom regex pattern

Properties

  • type = regex

  • pattern: String.

Word characters (equivalent of regex '\w+' as described at this link)

Properties

  • type = word_characters

Custom alphabet part

Used in: parts

optional Object.

Depending on type property value, can be one of the following:

characters

unicode_block

unicode_range

Parser configuration

Used in: format

optional Object.

Properties

  • encoding: String.
    This property enables to specify an encoding format.

  • delimiter: String.
    You have the option to specify a custom delimiter.

  • trim: Boolean.
    The trim option removes leading and trailing spaces from each cell.

Elements dictionary for multiple columns

Used in: values

map of String keys to[.green]optional String.

Column value extract method

Used in: columns

optional Object.
There are multiple ways to access categories and weights from a CSV file. By default, when the columns parameter is absent, the generator expects a file with more than one column, where the column with index 0 represents the category, and the column with index 1 represents the weight. In the default configuration, file headers are not accepted. However, if you need to work with specific columns, you can specify which columns will provide data using either the NAME or INDEX column parameter.

Depending on column_accessor_type property value, can be one of the following:

NAME

INDEX

Categories dictionary for multiple columns

Used in: values

map of String keys to`String.`

Column value extract method for multiple columns

Used in: columns

optional Object.
There are multiple ways to access categories and weights from a CSV file. By default, when the columns parameter is absent, the generator expects a file with (table columns size + 1) columns, where the column with index 0 represents the first category, and the last column represents the weight. In the default configuration, file headers are not accepted. However, if you need to work with specific columns, you can specify which columns will provide data using either the NAME or INDEX column parameter.

Depending on column_accessor_type property value, can be one of the following:

NAME

INDEX

Character set

Custom alphabet which can consist of 1 to (2^16) characters. All printable characters from Unicode Basic Multilingual Plane are supported.

Properties

  • type = characters

  • characters: String.

Unicode block

Unicode block by name. Name of the Unicode block formatted according to the results described in Java’s UnicodeBlock documentation. Examples: "BASIC_LATIN", "Basic Latin". Only the blocks from BMP (codepoints from 0x0000 to 0xFFFF) are supported. You can refer to the Unicode specification to find out the range for a block of interest.

Properties

  • type = unicode_block

  • name: String.

Unicode range

Unicode range of characters specified by the first and last character range int codes. You can use ranges from 0x0000 to 0xFFFF.

Properties

  • type = unicode_range

  • from: Integer (int32).

  • to: Integer (int32).

Column value extract method

Used in: columns

optional Object.
There are several ways to access the categories and weights from a CSV file. By default, if the columns parameter is not specified, the generator will map the columns from the file to the transformation columns in order. In the default configuration, file headers are not accepted. However, if you need to work with specific columns, you can specify which columns will provide data using either the NAME or INDEX column parameter.

Depending on column_accessor_type property value, can be one of the following:

NAME

INDEX

String column name value extract

Properties

  • column_accessor_type = NAME

  • categories: String.

  • weights: String.

Column index value extract

Properties

  • column_accessor_type = INDEX

  • categories: Integer.

  • weights: Integer.

String column name value extract for multiple columns

Properties

  • column_accessor_type = NAME

  • weights: String.

Column index value extract

Properties

  • column_accessor_type = INDEX

  • weights: Integer.

String column name value extract

Properties

  • column_accessor_type = NAME

  • names: array of String.

Column index value extract

Properties

  • column_accessor_type = INDEX

  • indexes: array of Integer.

Categories dictionary for multiple columns

Used in: categories

map of String keys to`String.`

Categories dictionary for multiple columns

Used in: categories

map of String keys to`Integer.`