Transformations

Transformations

optional Object.
Parameters of a transformation. All parameters have a type key with the type name of the transformation, and other parameters that are transformation-specific.

Depending on type property value, can be one of the following:

Key

Link

Modes

Data types

Multiple columns

categorical_generator

GENERATION, MASKING, KEEP

ANY

No

conditional_generator

GENERATION, MASKING, KEEP

ANY

Yes

continuous_generator

GENERATION, MASKING, KEEP

NUMERIC

No

quantile_generator

GENERATION, MASKING, KEEP

NUMERIC

No

copy_parent_generator

GENERATION, MASKING, KEEP

ANY

Yes

foreign_key_generator

GENERATION, MASKING, KEEP

ANY

Yes

unique_generator

GENERATION, MASKING, KEEP

ANY

Yes

format_preserving_hashing

MASKING, KEEP

TEXT

No

formatted_string_generator

GENERATION, MASKING, KEEP

ANY

No

int_sequence_generator

GENERATION, MASKING, KEEP

NUMERIC

No

string_sequence_generator

GENERATION, MASKING, KEEP

TEXT

No

noising

MASKING, KEEP

NUMERIC

No

null_generator

GENERATION, MASKING, KEEP

ANY

Yes

default_value_generator

GENERATION, MASKING, KEEP

ANY

Yes

passthrough

MASKING, KEEP

ANY

Yes

person_generator

GENERATION, MASKING, KEEP

TEXT

Yes

address_generator

GENERATION, MASKING, KEEP

TEXT

Yes

finance_generator

GENERATION, MASKING, KEEP

TEXT

Yes

redaction

MASKING, KEEP

TEXT

No

unique_hashing

MASKING, KEEP

NUMERIC

No

date_time_unique_hashing

MASKING, KEEP

DATE

No

date_generator

GENERATION, MASKING, KEEP

DATE

No

uuid_generator

GENERATION, MASKING, KEEP

ANY

No

constant_numeric

GENERATION, MASKING, KEEP

NUMERIC

No

constant_string

GENERATION, MASKING, KEEP

TEXT

No

constant_date

GENERATION, MASKING, KEEP

DATE

No

constant_boolean

GENERATION, MASKING, KEEP

BOOLEAN

No

constant_xml

GENERATION, MASKING, KEEP

ANY

No

json_pointer_transformer

MASKING, KEEP

ANY

No

xpath_transformer

MASKING, KEEP

ANY

No

void_generator

GENERATION, MASKING, KEEP

ANY

Yes

scripting_transformer

GENERATION, MASKING, KEEP

ANY

Yes

Categorical generator

Randomly sample from a given key / value of categories and weights. Weights and categories can be provided from various datasources or learned from data.

The following example demonstrates how to accomplish this:

    transformations:
      - columns:
          - transaction_type
        params:
          type: categorical_generator
          categories:
            type: STRING
            value_source: PROVIDED
            nullable_weight: 0.1
            values:
              "sent": 0.5
              "received": 0.4
              "skipping_value": 0.0

You can also specify a CSV file as the source of categories. The following example illustrates the complete configuration example:

    transformations:
      - columns:
          - transaction_type
        params:
          type: categorical_generator
          categories:
            type: STRING
            value_source: CSV_FILE
            path: src/e2e/resources/data_with_header.csv
            nullable_weight: 239
            format:
              columns:
                column_accessor_type: NAME
                categories: "title"
                weights: "rank"
              encoding: "UTF-8"
              delimiter: ","
              trim: true

For simple scenarios, the minimal configuration may be useful, which relies on the defaults mentioned earlier. The following example illustrates this approach:

    transformations:
      - columns:
          - transaction_type
        params:
          type: categorical_generator
          categories:
            type: STRING
            value_source: CSV_FILE
            path: src/e2e/resources/data.csv
This generator normalizes weights to the probability interval. If the sum of weights exceeds the capacity of the double format, you need to use weights with a smaller scale factor.

Properties

  • type = categorical_generator

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: No

Conditional generator

Uses one of two transformations/generators depending on the value of the given field of the parent table. For example, using conditional generator one may use different generators depending on the value of "gender" column of the parent table.

Example:

    transformations:
      - columns: [ "status" ]
        params:
          type: conditional_generator
          conditional_table: "public.delivery"
          conditional_column: "status"
          conditional_value: "DONE"
          if_true:
            type: constant_string
            value: "CLOSED"
          if_false:
            type: constant_string
            value: "OPEN"

Properties

  • type = conditional_generator

  • conditional_column: String.
    Parent column.

  • conditional_table: optional String.
    Parent table.

  • conditional_value: String.
    Value to be compared with. If the value of the parent column is equal to conditional_value, then if_true generator is used, otherwise if_false generator.

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: Yes

Continuous generator

Output data is sampled from a parameterized continuous distribution. If parameters are not given, they will be fitted from the original data.

Example:

    transformations:
      - columns:
          - "amount"
        params:
          type: continuous_generator
          mean: 354.21
          std: 98.96
          min: 0.0

Properties

  • type = continuous_generator

  • mean: optional Number (double).
    Mean of the sampled distribution

  • std: optional Number (double).
    Standard Deviation

  • min: optional Number (double).
    Minimum value

  • max: optional Number (double).
    Maximum value

  • round: Integer.
    If given, output data will be rounded to this number of digits

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: NUMERIC

Supports multiple columns: No

Quantile generator

Given a list of probabilities and bin edges, the output data is sampled from a mixture of uniform distributions, where each uniform distribution i is chosen with probability probabilities[i] and its edges are given by bin_edges[i] and bin_edges[i + 1]. If parameters are not given, they will be fitted from the original data.

Example:

    transformations:
      - columns: ["amount"]
        params:
          type: quantile_generator
          hist: [0.5, 0.2, 0.3]
          bin_edges: [0.1, 0.15, 0.3, 0.45]
          numeric_type: DOUBLE

Properties

  • type = quantile_generator

  • hist: optional array of Number (double).
    Probabilities of each uniform distribution.

  • bin_edges: optional array of Number (double).
    Bin edges of each uniform distribution.

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: NUMERIC

Supports multiple columns: No

Copy parent generator

Copies values from parent table. Can be used for de-normalization of the database, e. g. for copying address or phone_number field from customers table to orders table.

Example:

    transformations:
      - columns: [ "phone_number" ]
        params:
          type: copy_parent_generator
          parent_tables: [ "public.employees" ]
          parent_columns: [ "phone_number" ]

Properties

  • type = copy_parent_generator

  • parent_columns: array of String.
    Columns to copy the values from.

  • parent_tables: array of String.
    Tables to copy the values from.

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: Yes

Foreign key generator

Fills columns with the parent table’s primary key values of a random row.

Normally this generator is being created implicitly wherever data generation for tables related with foreign keys is needed, so it should not be explicitly set up by the user.

Properties

  • type = foreign_key_generator

  • referred_schema: optional String.

  • referred_table: optional String.

  • referred_fields: optional array of String.

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: Yes

Unique generator

This generator is intended for the case where primary key values are part of the foreign key.

Normally this generator is being created implicitly wherever data generation for tables related with foreign keys is needed, so it should not be explicitly set up by the user.

Properties

  • type = unique_generator

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: Yes

Format preserving hashing

A hash transformation is applied to each character, which included into the configured group, in a given text so that the output preserves the format but contains different characters. This transformation is secure and non-reversible.

Examples:

Default configuration:

    transformations:
      - columns: ["registration_number"]
        params:
          type: format_preserving_hashing
          groups:
            - selector:
                type: digits
              alphabets:
                - type: digits
            - selector:
                type: lower_letters
              alphabets:
                - type: lower_letters
            - selector:
                type: upper_letters
              alphabets:
                - type: upper_letters

Mask only last 5 characters:

    transformations:
      - columns: ["registration_number"]
        params:
          type: format_preserving_hashing
          filter:
            type: "last"
            n: 5

Mask only substring ignoring case:

    transformations:
      - columns: ["registration_number"]
        params:
          type: format_preserving_hashing
          filter:
            type: substring
            substring: sub
            ignore_case: true

Mask only a set of characters ignoring case:

    transformations:
      - columns: ["registration_number"]
        params:
          type: format_preserving_hashing
          filter:
            type: characters
            characters: "abc"
            ignore_case: true

Mask characters selected by regex with a custom alphabet:

    transformations:
      - columns: ["phone_number"]
        params:
          type: format_preserving_hashing
          groups:
            - selector:
                type: regex
                pattern: "[123]"
              alphabets:
                - type: custom
                  parts:
                    - type: characters
                      characters: "456"
                    - type: characters
                      characters: "789"
                    - type: unicode_block
                      name: LATIN_EXTENDED_D
                    - type: unicode_block
                      name: "Latin Extended-A"
                    - type: unicode_range
                      from: 0x0D00
                      to: 0x0D7F

Properties

  • type = format_preserving_hashing

Hashing groups to apply on top of the specified filter. There can be multiple groups configured. In that case the groups will be tried to match a region within the filtered value in the order they are specified in configuration. If a match is successfully found, the corresponding group’s alphabet will be used for transformation, and no other groups will be tried for that region. This implies that most specific hashing groups must be specified first in the configuration. Unspecified parameter or null is equivalent to the following:

    transformations:
      - columns: ["registration_number"]
        params:
          type: format_preserving_hashing
          groups:
            - selector:
                type: digits
              alphabets:
                - type: digits
            - selector:
                type: lower_letters
              alphabets:
                - type: lower_letters
            - selector:
                type: upper_letters
              alphabets:
                - type: upper_letters
            - selector:
                type: word_characters
              alphabets:
                - type: digits
                - type: lower_letters
                - type: upper_letters

Compatible modes: MASKING, KEEP

Compatible column data types: TEXT

Supports multiple columns: No

Formatted string generator

Generate a string column based on a given pattern. If the pattern is not given, will generate random characters with similar length as original column.

Example:

    transformations:
      - columns:
          - "phone_number"
        params:
          type: formatted_string_generator
          pattern: "\\+44[0-9]{10}"

Properties

  • type = formatted_string_generator

  • pattern: optional String.
    Regular expression pattern used to sample data from

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: No

Integers sequence generator

Generate a sequence of integers that represent a unique id column that contain unique values.

Properties

  • type = int_sequence_generator

  • start_from: Integer.
    Where to start the sequence from, default to 0. If the generator is used on existing data, this should be used as the maximum of the existing data plus 1.

Example:

    transformations:
      - columns:
          - "user_id"
        params:
          type: int_sequence_generator

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: NUMERIC

Supports multiple columns: No

String sequence generator

Generate a sequence of strings that represent a unique id column that contain unique values, including uppercase alphabetic and numeric values.

Example:

    transformations:
      - columns: ["country_id"]
        params:
          type: string_sequence_generator

Properties

  • type = string_sequence_generator

  • length: optional Integer.
    Maximum length of the column, extracted from the database DDL if not given

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: TEXT

Supports multiple columns: No

Noising transformation

Add laplacian noise to the input column in order to protect the privacy but output similar values.

Example:

    transformations:
      - columns: ["product_price"]
        params:
          type: noising
          sensitivity: 23.47
          min: 0

Properties

  • type = noising

  • sensitivity: optional Number (double).
    Amount of noise to be added

  • min: optional Number (double).
    If there’s a hard minimum, transformation will truncate output values there if smaller

  • max: optional Number (double).
    If there’s a hard maximum, transformation will truncate output values there if greater

Compatible modes: MASKING, KEEP

Compatible column data types: NUMERIC

Supports multiple columns: No

Null generator

The output column is filled with null values

Example:

    transformations:
      - columns: ["empty_column"]
        params:
          type: null_generator

Properties

  • type = null_generator

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: Yes

Default value generator

The output column is populated with default or automatically generated values. In the absence of these options, null values will be inserted. Example:

    transformations:
      - columns: ["empty_column"]
        params:
          type: default_value_generator

Properties

  • type = default_value_generator

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: Yes

Passthrough transformation

The output data is equal to the input, no transformation is applied.

Example:

    transformations:
      - columns: ["customer_number", "plate"]
        params:
          type: passthrough

Properties

  • type = passthrough

Compatible modes: MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: Yes

Person generator

Generate personal fields (e.g., name, surname, title) and keep them consistent across columns.

Available templates are:

  • ${email}

  • ${first_name}

  • ${male_first_name}

  • ${female_first_name}

  • ${last_name}

  • ${full_name}

  • ${username}

  • ${company}

  • ${phone_national}

  • ${phone_international}

  • ${ssn}

Supported locales:

ar

bg

ca

ca-CAT

cs

da-DK

de

de-AT

de-CH

en

en-AU

en-CA

en-GB

en-IND

en-MS

en-NEP

en-NG

en-NZ

en-PAK

en-SG

en-UG

en-US

en-ZA

en-PH

es

es-MX

fa

fi-FI

fr

he

hu

in-ID

it

ja

ko

nb-NO

nl

pl

pt

pt-BR

ru

sk

sv

sv-SE

tr

uk

vi

zh-CN

zh-TW

Example for several columns:

    transformations:
      - columns: ["first_name", "last_name"]
        params:
          type: person_generator
          column_templates: ["${first_name}", "${last_name}"]

Example for a single column:

    transformations:
      - columns: ["full_name"]
        params:
          type: person_generator
          column_templates: ["${first_name} ${last_name}"]

Properties

  • type = person_generator

  • column_templates: array of String.
    For each column, the template to be used to generate personal data

  • consistent_with_column: String.
    If given, the column that need to be consistent on. For example, if consistent_with_column="user_id" all people with same user_id will have the same name. The "self" value means consistency with the source value.

  • locale: optional String.
    To generate names from different geographical areas, the user can change this parameter. Default to 'en-GB', which corresponds to British names.

  • column_lengths: optional array of Integer.
    Max lengths allowable for the column. Ignored when length_exceeded_mode: "IGNORE"`

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: TEXT

Supports multiple columns: Yes

Address generator

Generate address fields (e.g., street, zip code) and keep them consistent across columns. Available templates are:

  • ${zip_code}

  • ${country}

  • ${city}

  • ${street_name}

  • ${house_number}

  • ${flat_number}

  • ${full_address}

  • ${street_address}

  • ${region}

  • ${latitude}

  • ${longitude}

  • ${coordinates}

  • ${time_zone}

Supported locales:

ar

bg

ca

ca-CAT

cs

da-DK

de

de-AT

de-CH

en

en-AU

en-CA

en-GB

en-IND

en-MS

en-NEP

en-NG

en-NZ

en-PAK

en-SG

en-UG

en-US

en-ZA

en-PH

es

es-MX

fa

fi-FI

fr

he

hu

in-ID

it

ja

ko

nb-NO

nl

pl

pt

pt-BR

ru

sk

sv

sv-SE

tr

uk

vi

zh-CN

zh-TW

Example for several columns:

    transformations:
      - columns: [ "street_name", "zip_code" ]
        params:
          type: address_generator
          column_templates: [ "${street_name}", "${zip_code}" ]

Example for a single column:

    transformations:
      - columns: ["address"]
        params:
          type: address_generator
          column_templates: ["${country}, ${city}, ${street_name}, ${house_number}, ${flat_number}, ${zip_code}"]

Properties

  • type = address_generator

  • column_templates: array of String.
    For each column, the template to be used to generate address data

  • consistent_with_column: String.
    If given, the column that need to be consistent on. For example, if consistent_with_column="user_id" all people with same user_id will have the same street. The "self" value means consistency with the source value.

  • locale: optional String.
    To generate addresses from different geographical areas, the user can change this parameter. Default to 'en-GB', which corresponds to Great Britain addresses.

  • column_lengths: optional array of Integer.
    Max lengths allowable for the column. Ignored when length_exceeded_mode: "IGNORE"`

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: TEXT

Supports multiple columns: Yes

Finance generator

Generate financial data.

Available templates:

  • ${credit_card}

  • ${bic}

  • ${iban}

  • ${nasdaq_ticker}

  • ${nyse_ticker}

  • ${stock_market}

  • ${us_routing_number}

The template credit_card (without the card type qualification) will result in a random type being picked.

The credit card type can be configured using the following templates:

  • ${credit_card.visa}

  • ${credit_card.mastercard}

  • ${credit_card.discover}

  • ${credit_card.american_express}

  • ${credit_card.diners_club}

  • ${credit_card.jcb}

  • ${credit_card.switch}

  • ${credit_card.solo}

  • ${credit_card.dankort}

  • ${credit_card.forbrugsforeningen}

  • ${credit_card.laser}

Example:

    transformations:
      - columns: [ "credit_card" ]
        params:
          type: finance_generator
          column_templates: [ "${credit_card.visa}" ]

Properties

  • type = finance_generator

  • column_templates: array of String.
    For each column, the template to be used to generate financial data

  • consistent_with_column: String.
    If given, the column that need to be consistent on. The "self" value means consistency with the source value.

  • column_lengths: optional array of Integer.
    Max lengths allowable for the column. Ignored when length_exceeded_mode: "IGNORE"`

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: TEXT

Supports multiple columns: Yes

Redaction masker

Some values in the input string are substituted by the same value, obtaining partially masked text in the output.

Example:

    transformations:
      - columns: ["credit_card"]
        params:
          type: redaction
          action: MASK
          which: FIRST
          count: 4
          mask_with: "#"

Properties

  • type = redaction

  • count: Integer.
    amount of characters to be masked or kept, default to 4

  • mask_with: String.
    character used to mask values, default to *

Compatible modes: MASKING, KEEP

Compatible column data types: TEXT

Supports multiple columns: No

Unique hashing

Apply a hash transformation to a given value so that the output is encrypted but structural coherence is preserved (same input hashed with same key is always going to produce same output). Output values are unique.

This transformation is applied to primary and foreign keys by default in MASKING mode.

Example:

    transformations:
      - columns: ["card_id"]
        params:
          type: unique_hashing

Properties

  • type = unique_hashing

  • max_value: Number (double).
    Max value to generate, null means absence of limit

  • precision: Integer.
    Max precision to generate (e.g. if the value is 3, the maximal value is 999), null means absence of limit. Minimal value is applied if both max_value and precision are specified

Compatible modes: MASKING, KEEP

Compatible column data types: NUMERIC

Supports multiple columns: No

Date unique hashing

Apply a hash transformation to a date time format value so that the output is encrypted but structural coherence is preserved (same input hashed with same key is always going to produce same output). Output values are unique.

This transformation is applied to primary and foreign keys by default in MASKING mode.

Example:

    transformations:
      - columns: ["create_date"]
        params:
          type: date_time_unique_hashing
          min: 2000-01-01T12:00:00Z
          max: 2022-01-01T12:00:00Z

Properties

  • type = date_time_unique_hashing

  • min: optional String (date-time).
    Minimum value

  • max: optional String (date-time).
    Maximum value

Compatible modes: MASKING, KEEP

Compatible column data types: DATE

Supports multiple columns: No

Date generator

Output data is sampled from a parameterized continuous distribution, and transformed into dates. If parameters are not given, they will be extracted from the original data

Example:

    transformations:
      - columns:
          - "date_of_birth"
        params:
          type: date_generator
          mean: 2018-02-01T12:00:00Z
          std: 2d 4h 45m 12s 434ms
          min: 2000-01-01T12:00:00Z
          max: 2022-01-01T12:00:00Z

Properties

  • type = date_generator

  • mean: optional String (date-time).
    Average date of the sampled distribution

  • std: optional String.
    Standard deviation. The following formats are accepted:

    • ISO-8601 Duration format, e.g., P1DT2H3M4.058S.

    • The concise format described here, e.g., 10s, 1h 30m or -(1h 30m)

    • Milliseconds without the specific unit, e.g., 12534.

  • min: optional String (date-time).
    Minimum value

  • max: optional String (date-time).
    Maximum value

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: DATE

Supports multiple columns: No

UUID generator

The output column is filled with UUIDs.

Example:

    transformations:
      - columns: ["unique_id"]
        params:
          type: uuid_generator

Properties

  • type = uuid_generator

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: No

Constant numeric generator

Generates a single numeric value for the entire column

Example:

    transformations:
      - columns: [ "balance" ]
        params:
          type: constant_numeric
          value: 0.0

Example (range):

    transformations:
      - columns: [ "balance" ]
        params:
          type: constant_numeric
          min: 0.0
          max: 10000.0

Properties

  • type = constant_numeric

  • value: optional Number.
    numeric value to generate

  • min: optional Number.
    The lower boundary for the value (inclusive)

  • max: optional Number.
    The upper boundary for the value (exclusive)

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: NUMERIC

Supports multiple columns: No

Constant string generator

Generates a single string value for the entire column

Example:

    transformations:
      - columns: [ "status" ]
        params:
          type: constant_string
          value: "ACTIVE"

Properties

  • type = constant_string

  • value: optional String.
    string value to generate

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: TEXT

Supports multiple columns: No

Constant date generator

Generates a single date value for the entire column

Example:

    transformations:
      - columns: [ "creation_date" ]
        params:
          type: constant_date
          value: 2022-07-28T12:21:00Z

Example (range):

    transformations:
      - columns: [ "creation_date" ]
        params:
          type: constant_date
          min: 2022-07-01T00:00:00Z
          max: 2022-07-31T23:59:59Z

Properties

  • type = constant_date

  • value: optional String (date-time).
    date value to generate

  • min: optional String (date-time).
    The lower boundary for the value (inclusive)

  • max: optional String (date-time).
    The upper boundary for the value (exclusive)

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: DATE

Supports multiple columns: No

Constant boolean generator

Generates a single boolean value for the entire column

Example:

    transformations:
      - columns: [ "is_active" ]
        params:
          type: constant_boolean
          value: true

Properties

  • type = constant_boolean

  • value: optional Boolean.
    boolean value to generate

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: BOOLEAN

Supports multiple columns: No

Constant XML generator

Generates a single XML value for the entire column Example:

transformations:
  - columns: [ "xml_column" ]
    params:
      type: constant_xml
      value: "<root>test</root>"

Properties

  • type = constant_xml

  • value: String.
    XML string value to generate

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: No

JSON Pointer Transformer

Transforms JSON value nodes indicated by JSON pointers. The rest of the values are kept as is.

Example:

    transformations:
      - columns: ["productspec"]
        params:
          type: "json_pointer_transformer"
          specifications:
            - pointers: [ "/sku" ]
              transformation:
                type: "format_preserving_hashing"
            - pointers: [ "/tags/0" ]
              transformation:
                type: "format_preserving_hashing"
              ignore_errors: true

Properties

  • type = json_pointer_transformer

Compatible modes: MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: No

Xml XPath Transformer

Transforms XML value nodes indicated by XML XPath. The rest of the values are kept as is.

Example:

transformations:
  - columns: ["productspec"]
    params:
      type: "xpath_transformer"
      specifications:
        - queries: [ "/sku" ]
          transformation:
            type: "format_preserving_hashing"

Properties

  • type = xpath_transformer

  • encoding: String.
    This property enables to specify an encoding format.

Compatible modes: MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: No

Void Generator

An auxiliary transformer that throws an error when called. It is used only when it is necessary to ignore the processing of the entire table.

Properties

  • type = void_generator

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: Yes

Scripting Transformer

The scripting transformer allows you to implement own logic for both the GENERATION mode and the MASKING mode. Currently only Javascript implementation is supported. The script for GENERATION mode must define a lambda function that returns a dictionary where keys are column names, and the values are the desired values of the record. If a transformer is applied to a single column, the value may be returned instead of a dictionary. The script for the MASKING mode must define a lambda function with the two arguments ctx and originalRecord. The following example shows how to use a custom script for multiple columns and MASKING mode.

    transformations:
      - columns:
          - textdescription
          - htmldescription
        params:
          type: scripting_transformer
          language: "JAVASCRIPT"
          script:
            code: |
              /**
               * @typedef { Object.<string, *> | * } Result
               *
               * @param {GenerationContext} ctx
               * @param {Record} originalRecord
               * @returns {Result}
               */
              (ctx, originalRecord) => {
                const dict = originalRecord.asMap();
                const textDescriptionColumn = columns.get(0);
                const htmlDescriptionColumn = columns.get(1);
                const descriptionWithoutSpaces = dict.get(textDescriptionColumn).trim();
                return { [textDescriptionColumn]: descriptionWithoutSpaces, [htmlDescriptionColumn]: descriptionWithoutSpaces };
              }

The script for the GENERATION mode should define a lambda function with the single argument ctx. The following example shows how to use a custom script for GENERATION:

    transformations:
      - columns:
          - credit_card
        params:
          type: scripting_transformer
          language: "JAVASCRIPT"
          additional_properties:
            first_credit_card_digit: 4
          init_script:
            code: |
              /**
               * @returns {String}
               */
              function generateRandomCreditCardNumber() {
                let creditCardNumber = additionalProperties["first_credit_card_digit"]

                for (let i = 1; i < 16; i++) {
                  const digit = Math.floor(Math.random() * 10);
                  creditCardNumber += digit.toString();
                }

                return creditCardNumber;
              }
          script:
            code: |
              /**
               * @typedef { Object.<string, *> | * } Result
               *
               * @param {GenerationContext} ctx
               * @returns {Result}
               */
              (ctx) => generateRandomCreditCardNumber();

Properties

  • type = scripting_transformer

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: Yes

Categories source configuration

Used in: categories

optional Object.

Depending on value_source property value, can be one of the following:

PROVIDED

CSV_FILE

Numeric type

optional String.
Type of numbers used by a generator

Enum values
  • INT

  • LONG

  • DOUBLE

  • FLOAT

  • BIG_DECIMAL

  • BIG_INTEGER

  • SHORT

  • BYTE

  • UNSIGNED_BYTE

  • UNSIGNED_INTEGER

  • UNSIGNED_LONG

  • UNSIGNED_SHORT

Foreign key distribution

Used in: distribution

String.
Distributions: POISSON (default) - Generates parent-child relations based on the Poisson distribution. Where lambda represents the ratio of parent to child count. ROUND_ROBIN - Assign children to parents using a round-robin algorithm. ORIGINAL - Preserves the original reference ratio (Note: This option may require more computational resources for recalculation of the model. It’s worth noting that this model may generate a distribution that does not perfectly match the original one).

Enum values
  • POISSON

  • ROUND_ROBIN

  • ORIGINAL

Hashing group

Used in: groups

Object.
The pair of selector and list of alphabet. selector is used to choose characters from the input string, alphabet - is a set of characters, which are used to replace source ones.

Properties

Format preserving hashing filter

Used in: filter

optional Object.

Depending on type property value, can be one of the following:

first

last

characters

substring

regex

Value length exceeded mode

optional String.
Action, required on value length overflow. Modes: IGNORE - error if the value exceeds column length TRUNCATE (default) - truncate value to the field length

Enum values
  • IGNORE

  • TRUNCATE

Action

Used in: action

String.

Enum values
  • KEEP

  • MASK

Position

Used in: which

String.

Enum values
  • FIRST

  • LAST

JSON Pointer Transformer Specification

Used in: specifications

Properties

  • pointers: array of String.
    JSON Pointer (specified by RFC6901)

  • ignore_errors: Boolean.
    Controls the behaviour when no JSON node is found at the pointer or the node has a type incompatible with the specified transformer. If this setting is true, the found JSON node, if any, will remain unchanged. If the setting is false, an error will be raised. Default is false.

Xml XPath Transformer Specification

Used in: specifications

Properties

  • ignore_errors: Boolean.
    Controls the behaviour when no XML node is found at the xPath or the node has a type incompatible with the specified transformer. If this setting is true, the found XML node, if any, will remain unchanged. If the setting is false, an error will be raised. Default is false.

Scripting Language

Used in: language

String.

Enum values
  • JAVASCRIPT

Script

Used in: script, init_script

optional Object.
The script should define a lambda function to be called on every record or row depending on the chosen method. The init_script is executed once on start. It can be helpful for defining variables and functions which are available to use in the main script

Properties

  • code: optional String.
    Script code

  • file: optional String.
    Script file location. The script can be located on local file system, AWS S3 and Google Storage.

To be able to load scripts from S3 the property TDK_AWS_ENABLED==true should be set. More details can be found here.

The property TDK_GCP_ENABLED==true allows loading scripts from Google Storage. More details can be found here.

ScriptingAdditionalProperties

optional map of String keys to

Properties

  • value: Object.
    Additional properties to be used in the script and init_script. The dictionary variable additionalProperties is available from scripts.

Config categories source

Config source involves configuring categories and weights directly within the configuration.

Properties

  • value_source = PROVIDED

  • nullable_weight: Number.
    When one of the categories is NULL, manually assign the appropriate weight for this category.

CSV file categories source

Properties

  • value_source = CSV_FILE

  • path: String.
    The path to file on your file system.

  • nullable_weight: Number.
    When one of the categories is NULL, manually assign the appropriate weight for this category.

Hashing group selector

Used in: selector

Object.

Depending on type property value, can be one of the following:

digits

lower_letters

upper_letters

regex

word_characters

Format preserving hashing group alphabet

Used in: alphabets

Object.

Depending on type property value, can be one of the following:

digits

lower_letters

upper_letters

custom

First N characters

Mask only first N characters of the input string

Properties

  • type = first

  • n: Integer (int32).

Last N characters

Mask only last N characters of the input string

Properties

  • type = last

  • n: Integer (int32).

Specified characters

Mask only specified characters of the input string

Properties

  • type = characters

  • characters: String.

  • ignore_case: Boolean.

Specified substring

Mask only specified substring of the input string

Properties

  • type = substring

  • substring: String.

  • ignore_case: Boolean.

Regex filter

Mask only characters filtered by regex

Properties

  • type = regex

  • pattern: String.

  • ignore_case: Boolean.

Category data type

Used in: type, type

String.
Data type of categories. Modes: STRING (default) - Interpret values as strings BOOLEAN - Interpret values as booleans NUMERIC - Interpret values as doubles

Enum values
  • STRING

  • BOOLEAN

  • NUMERIC

Categories dictionary

Used in: values

map of String keys to`Number.`
The map can store keys in one of three formats: STRING, BOOLEAN or NUMERIC. The weights must be equal to or greater than zero.

Example of formats:

---- "23": 123 "owl": 534 "true": 532 ----

To disable one of the categories, simply set its weight value to zero.

Parser configuration

Used in: format

optional Object.

Properties

  • encoding: String.
    This property enables to specify an encoding format.

  • delimiter: String.
    You have the option to specify a custom delimiter.

  • trim: Boolean.
    The trim option removes leading and trailing spaces from each cell.

Digits 0-9

Properties

  • type = digits

Lower-case alphabetic characters

Properties

  • type = lower_letters

Upper-case alphabetic characters

Properties

  • type = upper_letters

Custom regex pattern

Properties

  • type = regex

  • pattern: String.

Word characters (equivalent of regex '\w+' as described at this link)

Properties

  • type = word_characters

Digits 0-9

Properties

  • type = digits

English letters in lowercase [a-z]

Properties

  • type = lower_letters

English letters in UPPERCASE [A-Z]

Properties

  • type = upper_letters

Custom alphabet

Custom alphabet which can consist of characters, unicode blocks and unicode ranges. In total it can be from 1 to (2^16) characters.

Properties

  • type = custom

Column value extract method

Used in: columns

optional Object.
There are multiple ways to access categories and weights from a CSV file. By default, when the columns parameter is absent, the generator expects a file with more than one column, where the column with index 0 represents the category, and the column with index 1 represents the weight. In the default configuration, file headers are not accepted. However, if you need to work with specific columns, you can specify which columns will provide data using either the NAME or INDEX column parameter.

Depending on column_accessor_type property value, can be one of the following:

NAME

INDEX

Custom alphabet part

Used in: parts

optional Object.

Depending on type property value, can be one of the following:

characters

unicode_block

unicode_range

String column name value extract

Properties

  • column_accessor_type = NAME

  • categories: String.

  • weights: String.

Column index value extract

Properties

  • column_accessor_type = INDEX

  • categories: Integer.

  • weights: Integer.

Character set

Custom alphabet which can consist of 1 to (2^16) characters. All printable characters from Unicode Basic Multilingual Plane are supported.

Properties

  • type = characters

  • characters: String.

Unicode block

Unicode block by name. Name of the Unicode block formatted according to the results described in Java’s UnicodeBlock documentation. Examples: "BASIC_LATIN", "Basic Latin". Only the blocks from BMP (codepoints from 0x0000 to 0xFFFF) are supported. You can refer to the Unicode specification to find out the range for a block of interest.

Properties

  • type = unicode_block

  • name: String.

Unicode range

Unicode range of characters specified by the first and last character range int codes. You can use ranges from 0x0000 to 0xFFFF.

Properties

  • type = unicode_range

  • from: Integer (int32).

  • to: Integer (int32).