Transformations List

Generators List

Generators are compatible with all table modes.

categorical_generator

Randomly sample from a given set of categories and probabilities. Probabilities and categories can be provided or learned from data. If given, both parameters are required.

Parameters:

  • categories: List<String>?: List of categories to be sampled from. Supported string, numeric and boolean types.

  • probabilities: List<Double>?: Probabilities for each category (must have same size as categories)

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: STRING, NUMERIC, BOOLEAN

Supports multiple columns: No

Example:

column_params:
- columns:
  - "transaction_type"
  params:
    type: "categorical_generator"
    categories:
      type: string
      values:
      - "SENT"
      - "RECEIVED"
    probabilities:
    - 0.6
    - 0.4
continuous_generator

Output data is sampled from a parameterized continuous distribution. If parameters are not given, they will be fitted from the original data.

Parameters:

  • mean: Double?: Mean of the sampled distribution

  • std: Double?: Standard Deviation

  • min: Double?: Minimum value

  • max: Double?: Maximum value

  • round: Int = 0: If given, output data will be rounded to this number of digits

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: NUMERIC

Supports multiple columns: No

Example:

column_params:
- columns:
  - "amount"
  params:
    type: "continuous_generator"
    mean: 354.21
    std: 98.96
    min: 0.0
quantile_generator

Given a list of probabilities and bin edges, the output data is sampled from a mixture of uniform distributions, where each uniform distribution i is chosen with probability probabilities[i] and its edges are given by bin_edges[i] and bin_edges[i + 1]. If parameters are not given, they will be fitted from the original data.

Parameters:

  • hist: List<Double>?: Probabilities of each uniform distribution.

  • bin_edges: List<Double>?: Bin edges of each uniform distribution.

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: NUMERIC

Supports multiple columns: No

Example:

column_params:
- columns:
  - "amount"
  params:
    type: "quantile_generator"
    hist: [-2.1, 0.0, 3.4, 5.6]
    bin_edges: [0.3, 0.45, 0.25]
date_generator

Output data is sampled from a parameterized continuous distribution, and transformed into dates. If parameters are not given, they will be extracted from the original data

Parameters:

  • mean: ZonedDateTime?: Average date of the sampled distribution

  • std: Int?: Standard deviation in milliseconds

  • min: ZonedDateTime?: Minimum value

  • max: ZonedDateTime?: Maximum value

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: DATE

Supports multiple columns: No

Example:

column_params:
- columns:
  - "date_of_birth"
  params:
    type: "date_generator"
formatted_string_generator

Generate a string column based on a given pattern. If the pattern is not given, will generate random characters with similar length as original column.

Parameters:

  • pattern: String?: Regular expression pattern used to sample data from

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: STRING

Supports multiple columns: No

Example:

column_params:
- columns:
  - "phone_number"
  params:
    type: "formatted_string_generator"
    pattern: "\\+44[0-9]{10}"
int_sequence_generator

Generate a sequence of integers that represent a unique id column that contain unique values.

Parameters:

  • start_from: Int = 0: Where to start the sequence from, default to 0. If the generator is used on existing data, this should be used as the maximum of the existing data plus 1.

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: INTEGER

Supports multiple columns: No

Example:

column_params:
- columns:
  - "user_id"
  params:
    type: "int_sequence_generator"
string_sequence_generator

Generate a sequence of strings that represent a unique id column that contain unique values, including uppercase alphabetic and numeric values.

Parameters:

  • length: Int?: Maximum length of the column, extracted from the database DDL if not given

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: STRING

Supports multiple columns: No

Example:

column_params:
- columns:
  - "country_id"
  params:
    type: "string_sequence_generator"
null_generator

The output column is filled with null values

No parameters.

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: Yes

Example:

column_params:
- columns:
  - "empty_column"
  params:
    type: "null_generator"
constant_numeric NEW

Generate a single numeric value for the entire column

Parameters:

  • value: Number?: numeric value to generate

  • min: Number?: The lower boundary for the value (inclusive)

  • max: Number?: The upper boundary for the value (exclusive)

Either value or both min and max parameters should be set.

Compatible modes: GENERATION, MASKING

Compatible column data types: NUMERIC

Supports multiple columns: No

Example:

column_params:
- columns: [ "balance" ]
  params:
    type: "constant_numeric"
    value: 0.0

Example (range):

column_params:
- columns: [ "balance" ]
  params:
    type: "constant_numeric"
    min: 0.0
    max: 10000.0
constant_date NEW

Generate a single date value for the entire column

Parameters:

  • value: ZonedDateTime?: date value to generate

  • min: ZonedDateTime?: The lower boundary for the value (inclusive)

  • max: ZonedDateTime?: The upper boundary for the value (exclusive)

Either value or both min and max parameters should be set.

Compatible modes: GENERATION, MASKING

Compatible column data types: DATE

Supports multiple columns: No

Example:

column_params:
- columns: [ "creation_date" ]
  params:
    type: "constant_date"
    value: "2022-07-28T12:21:00Z"

Example (range):

column_params:
- columns: [ "creation_date" ]
  params:
    type: "constant_date"
    min: "2022-07-01T00:00:00Z"
    max: "2022-07-31T23:59:59Z"
constant_string NEW

Generate a single string value for the entire column

Parameters:

  • value: String?: string value to generate

Compatible modes: GENERATION, MASKING

Compatible column data types: TEXT

Supports multiple columns: No

Example:

column_params:
- columns: [ "status" ]
  params:
    type: "constant_string"
    value: "ACTIVE"
constant_boolean NEW

Generate a single boolean value for the entire column

Parameters:

  • value: Boolean?: boolean value to generate

Compatible modes: GENERATION, MASKING

Compatible column data types: BOOLEAN

Supports multiple columns: No

Example:

column_params:
- columns: [ "is_active" ]
  params:
    type: "constant_boolean"
    value: true
person_generator

Generate personal fields (e.g. name, surname, title) and keep them consistent across columns.

Parameters:

  • column_templates: List<String>: For each column, the template to be used to generate personal data consistent_with_column: String?: If given, the column that need to be consistent on. For example, if consistent_with_column="user_id" all people with same user_id will have the same name

  • locale: String = 'en-GB': To generate names from different geographical areas, the user can change this parameter. Default to 'en-GB', which corresponds to British names.

Available templates are:

  • ${email}

  • ${first_name}

  • ${male_first_name}

  • ${female_first_name}

  • ${last_name}

Supported locales:

bg

ca

ca-CAT

da-DK

de

de-AT

de-CH

en

en-AU

en-au-ocker

en-BORK

en-CA

en-GB

en-IND

en-MS

en-NEP

en-NG

en-NZ

en-PAK

en-SG

en-UG

en-US

en-ZA

es

es-MX

fa

fi-FI

fr

he

hu

in-ID

it

ja

ko

nb-NO

nl

pl

pt

pt-BR

ru

sk

sv

sv-SE

tr

uk

vi

zh-CN

zh-TW

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: STRING

Supports multiple columns: Yes

Example for several columns:

column_params:
  - columns: ["first_name", "last_name"]
    params:
      type: "person_generator"
      column_templates: ["${first_name}", "${last_name}"]

Example for a single column:

column_params:
  - columns: ["full_name"]
    params:
      type: "person_generator"
      column_templates: ["${first_name} ${last_name}"]
address_generator

Generate address fields (e.g. street, zip code) and keep them consistent across columns.

Parameters:

  • column_templates: List<String>: For each column, the template to be used to generate address data consistent_with_column: String?: If given, the column that need to be consistent on. For example, if consistent_with_column="user_id" all people with same user_id will have the same street

  • locale: String = 'en-GB': To generate addresses from different geographical areas, the user can change this parameter. Default to 'en-GB', which corresponds to Great Britain addresses.

Available templates are:

  • ${zip_code}

  • ${country}

  • ${city}

  • ${street_name}

  • ${house_number}

  • ${flat_number}

Supported locales:

bg

ca

ca-CAT

da-DK

de

de-AT

de-CH

en

en-AU

en-au-ocker

en-BORK

en-CA

en-GB

en-IND

en-MS

en-NEP

en-NG

en-NZ

en-PAK

en-SG

en-UG

en-US

en-ZA

es

es-MX

fa

fi-FI

fr

he

hu

in-ID

it

ja

ko

nb-NO

nl

pl

pt

pt-BR

ru

sk

sv

sv-SE

tr

uk

vi

zh-CN

zh-TW

Compatible modes: GENERATION, MASKING, KEEP

Compatible column data types: STRING

Supports multiple columns: Yes

Example for several columns:

column_params:
  - columns: ["street_name", "zip_code"]
    params:
      type: "address_generator"
      column_templates: ["${street_name}", "${zip_code}"]

Example for a single column:

column_params:
  - columns: ["address"]
    params:
      type: "address_generator"
      column_templates: ["${country}, ${city}, ${street_name}, ${house_number}, ${flat_number}, ${zip_code}"]

Masking transformers

Masking transformers can only be applied for KEEP and MASKING modes. When applying these transformers, the output will contain a one-to-one transformation of each input row.

passthrough

The output data is equal to the input, no transformation is applied.

No parameters.

Compatible modes: MASKING, KEEP

Compatible column data types: ANY

Supports multiple columns: Yes

Example:

column_params:
  - columns: ["customer_number", "plate"]
    params:
      type: "passthrough"
format_preserving_hashing NEW

A hash transformation is applied to each character, which included into the configured group, in a given text so that the output preserves the format but contains different characters. This transformation is secure and non-reversible.

Parameters:

  • groups: List<FormatPreservingHashingGroup>: The pair of selector and list of alphabet. selector is used to choose characters from the input string, alphabet - is a set of characters, which are used to replace source ones.

  • filter: Filters are used to mask only a specified substring and keep other characters as is (e.g., mask only last 5 characters).

Available character selectors:

  • numeric

  • lower_letters

  • upper_letters

  • regex

Available alphabets:

  • numeric

  • lower_letters

  • upper_letters

  • custom

Available filters:

  • first - Mask only first n characters.

  • last - Mask only last n characters.

  • characters - Mask only specified characters. Parameters: characters - set of characters to mask, ignore_case (default: false) - indicates if case is taken into account.

  • substring - Mask all occurrences of specified substring. Parameters: substring - Substring to mask, ignore_case (default: false) - indicates if case is taken into account.

  • regex - Mask only characters matching by specified Regex pattern. Parameters: pattern - Regex pattern to find characters to mask, ignore_case (default: false) - indicates if case is taken into account.

Compatible modes: MASKING, KEEP

Compatible column data types: STRING

Supports multiple columns: No

Examples:

Default configuration:

column_params:
  - columns: ["registration_number"]
    params:
      type: "format_preserving_hashing"
      groups:
        - selector:
            type: "digits"
          alphabets:
            - type: "digits"
        - selector:
            type: "lower_letters"
          alphabets:
            - type: "lower_letters"
        - selector:
            type: "upper_letters"
          alphabets:
            - type: "upper_letters"

Mask only last 5 characters:

column_params:
  - columns: ["registration_number"]
    params:
      type: "format_preserving_hashing"
      filter:
        type: "last"
        n: 5

Mask only substring ignoring case:

column_params:
  - columns: ["registration_number"]
    params:
      type: "format_preserving_hashing"
      filter:
        type: "substring"
        substring: "sub"
        ignore_case: true

Mask only a set of characters ignoring case:

column_params:
  - columns: ["registration_number"]
    params:
      type: "format_preserving_hashing"
      filter:
        type: "characters"
        characters: "abc"
        ignore_case: "true"
noising

Add laplacian noise to the input column in order to protect the privacy but output similar values.

Parameters:

  • sensitivity: Float: Amount of noise to be added

  • min: Float?: If there’s a hard minimum, transformation will truncate output values there if smaller

  • max: Float?: If there’s a hard maximum, transformation will truncate output values there if greater

Compatible modes: MASKING, KEEP

Compatible column data types: NUMERIC

Supports multiple columns: No

Example:

column_params:
  - columns: ["product_price"]
    params:
      type: "noising"
      sensitivity: 23.47
      min: 0
redaction

Some values in the input string are substituted by the same value, obtaining partially masked text in the output.

Parameters:

  • action: str = "KEEP": Whether to KEEP or MASK values defined by which (default to KEEP)

  • which: str = "LAST": Which values (LAST or FIRST) to mask (or keep), depending on action

  • count: int = 4 amount of characters to be masked or kept, default to 4

  • mask_with: char = '' character used to mask values, default to ''

Compatible modes: MASKING, KEEP

Compatible column data types: NUMERIC

Supports multiple columns: No

Example:

column_params:
  - columns: ["credit_card"]
    params:
      type: "redaction"
      action: "MASK"
      which: "FIRST"
      count: 4
      maskWith: "#"
unique_hashing

Apply a hash transformation to a given value so that the output is encrypted but structural coherence is preserved (same input hashed with same key is always going to produce same output). Output values are unique.

This transformation is applied to primary and foreign keys by default in MASK mode.

Parameters:

  • maxValue: Double?: Max value to generate, null means absence of limit

  • precision: Int?: Max precision to generate (e.g. if the value is 3, the maximal value is 999), null means absence of limit. Minimal value is applied if both max_value and precision are specified

Compatible modes: MASKING, KEEP

Compatible column data types: INTEGER

Supports multiple columns: No

Example:

column_params:
  - columns: ["card_id"]
    params:
      type: "unique_hashing"