Transformations List#

Generators List#

Generators are compatible with all table modes.

categorical_generator

Randomly sample from a given set of categories and probabilities. Probabilities and categories can be provided or learned from data. If given, both parameters are required.

Parameters:

categories: List<String>?: List of categories to be sampled from. Supported string and boolean types.
probabilities: List<Double>?: Probabilities for each category (must have same size as categories)

Compatible modes: GENERATION MASKING KEEP

Compatible column data types: STRING, BOOLEAN

Supports multiple columns: No

Example:

column_params:
- columns:
  - "transaction_type"
  params:
    type: "categorical_generator"
    categories:
      type: string
      values:
      - "SENT"
      - "RECEIVED"
    probabilities:
    - 0.6
    - 0.4

continuous_generator

Output data is sampled from a parameterized continuous distribution. If parameters are not given, they will be fitted from the original data.

Parameters:

mean: Double?: Mean of the sampled distribution
std: Double?: Standard Deviation
min: Double?: Minimum value
max: Double?: Maximum value
round: Int = 0: If given, output data will be rounded to this number of digits

Compatible modes: GENERATION MASKING KEEP

Compatible column data types: NUMERIC

Supports multiple columns: No

Example:

column_params:
- columns:
  - "amount"
  params:
    type: "continuous_generator"
    mean: 354.21
    std: 98.96
    min: 0.0

quantile_generator

Given a list of probabilities and bin edges, the output data is sampled from a mixture of uniform distributions, where each uniform distribution i is chosen with probability probabilities[i] and its edges are given by bin_edges[i] and bin_edges[i + 1]. If parameters are not given, they will be fitted from the original data.

Parameters:

hist: List<Double>?: Probabilities of each uniform distribution.
bin_edges: List<Double>?: Bin edges of each uniform distribution.

Compatible modes: GENERATION MASKING KEEP

Compatible column data types: NUMERIC

Supports multiple columns: No

Example:

column_params:
- columns:
  - "amount"
  params:
    type: "quantile_generator"
    hist: [-2.1, 0.0, 3.4, 5.6]
    bin_edges: [0.3, 0.45, 0.25]

date_generator

Output data is sampled from a parameterized continuous distribution, and transformed into dates. If parameters are not given, they will be extracted from the original data

Parameters:

mean: LocalDateTime?: Average date of the sampled distribution
std: Int?: Standard deviation in milliseconds
min: LocalDateTime?: Minimum value
max: LocalDateTime?: Maximum value

Compatible modes: GENERATION MASKING KEEP

Compatible column data types: DATE

Supports multiple columns: No

Example:

column_params:
- columns:
  - "date_of_birth"
  params:
    type: "date_generator"

formatted_string_generator

Generate a string column based on a given pattern. If the pattern is not given, will generate random characters with similar length as original column.

Parameters:

pattern: String?: Regular expression pattern used to sample data from

Compatible modes: GENERATION MASKING KEEP

Compatible column data types: STRING

Supports multiple columns: No

Example:

column_params:
- columns:
  - "phone_number"
  params:
    type: "formatted_string_generator"
    pattern: "\\+44[0-9]{10}"

int_sequence_generator

Generate a sequence of integers that represent a unique id column that contain unique values.

Parameters:

start_from: Int = 0: Where to start the sequence from, default to 0. If the generator is used on existing data, this should be used as the maximum of the existing data plus 1.

Compatible modes: GENERATION MASKING KEEP

Compatible column data types: INTEGER

Supports multiple columns: No

Example:

column_params:
- columns:
  - "user_id"
  params:
    type: "int_sequence_generator"

string_sequence_generator

Generate a sequence of strings that represent a unique id column that contain unique values, including uppercase alphabetic and numeric values.

Parameters:

length: Int?: Maximum length of the column, extracted from the database DDL if not given

Compatible modes: GENERATION MASKING KEEP

Compatible column data types: STRING

Supports multiple columns: No

Example:

column_params:
- columns:
  - "country_id"
  params:
    type: "string_sequence_generator"

null_generator

The output column is filled with null values

No parameters.

Compatible modes: GENERATION MASKING KEEP

Compatible column data types: ANY

Supports multiple columns: Yes

Example:

column_params:
- columns:
  - "empty_column"
  params:
    type: "null_generator"

constant

Generate a single numeric value for the entire column

Parameters:

value: Number?: numeric value to generate

Compatible modes: GENERATION MASKING

Compatible column data types: NUMERIC

Supports multiple columns: No

Example:

column_params:
- columns: [ "balance" ]
  params:
    type: "constant"
    value: 0.0

person_generator

Generate personal fields (e.g. name, surname, title) and keep them consistent across columns.

Parameters:

column_templates: List<String>: For each column, the template to be used to generate personal data consistent_with_column: String?: If given, the column that need to be consistent on. For example, if consistent_with_column="user_id" all people with same user_id will have the same name
locale: String = 'en-GB': To generate names from different geographical areas, the user can change this parameter. Default to ‘en-GB’, which corresponds to British names.

Available templates are:

${email}
${first_name}
${male_first_name}
${female_first_name}
${last_name}

Supported locales:

bg
ca
ca-CAT
da-DK
de
de-AT
de-CH

en
en-AU
en-au-ocker
en-BORK
en-CA
en-GB
en-IND

en-MS
en-NEP
en-NG
en-NZ
en-PAK
en-SG
en-UG

en-US
en-ZA
es
es-MX
fa
fi-FI
fr

he
hu
in-ID
it
ja
ko
nb-NO

nl
pl
pt
pt-BR
ru
sk
sv

sv-SE
tr
uk
vi
zh-CN
zh-TW

Compatible modes: GENERATION MASKING KEEP

Compatible column data types: STRING

Supports multiple columns: Yes

Example for several columns:

column_params:
  - columns: ["first_name", "last_name"]
    params:
      type: "person_generator"
      column_templates: ["${first_name}", "${last_name}"]

Example for a single column:

column_params:
  - columns: ["full_name"]
    params:
      type: "person_generator"
      column_templates: ["${first_name} ${last_name}"]

address_generator

Generate address fields (e.g. street, zip code) and keep them consistent across columns.

Parameters:

column_templates: List<String>: For each column, the template to be used to generate address data consistent_with_column: String?: If given, the column that need to be consistent on. For example, if consistent_with_column="user_id" all people with same user_id will have the same street
locale: String = 'en-GB': To generate addresses from different geographical areas, the user can change this parameter. Default to ‘en-GB’, which corresponds to Great Britain addresses.

Available templates are:

${zip_code}
${country}
${city}
${street_name}
${house_number}
${flat_number}

Supported locales:

bg
ca
ca-CAT
da-DK
de
de-AT
de-CH

en
en-AU
en-au-ocker
en-BORK
en-CA
en-GB
en-IND

en-MS
en-NEP
en-NG
en-NZ
en-PAK
en-SG
en-UG

en-US
en-ZA
es
es-MX
fa
fi-FI
fr

he
hu
in-ID
it
ja
ko
nb-NO

nl
pl
pt
pt-BR
ru
sk
sv

sv-SE
tr
uk
vi
zh-CN
zh-TW

Compatible modes: GENERATION MASKING KEEP

Compatible column data types: STRING

Supports multiple columns: Yes

Example for several columns:

column_params:
  - columns: ["street_name", "zip_code"]
    params:
      type: "address_generator"
      column_templates: ["${street_name}", "${zip_code}"]

Example for a single column:

column_params:
  - columns: ["address"]
    params:
      type: "address_generator"
      column_templates: ["${country}, ${city}, ${street_name}, ${house_number}, ${flat_number}, ${zip_code}"]

Masking Transformers#

Masking transformers can only be applied for KEEP and MASKING modes. When applying these transformers, the output will contain a one-to-one transformation of each input row.

passthrough

The output data is equal to the input, no transformation is applied.

No parameters.

Compatible modes: MASKING KEEP

Compatible column data types: ANY

Supports multiple columns: Yes

Example:

column_params:
  - columns: ["customer_number", "plate"]
    params:
      type: "passthrough"

format_preserving_hashing

A hash transformation is applied to each alphanumeric character in a given text so that the output preserves the format but contains different characters. This transformation is secure and non-reversible.

No parameters.

Compatible modes: MASKING KEEP

Compatible column data types: STRING

Supports multiple columns: No

Example:

column_params:
  - columns: ["registration_number"]
    params:
      type: "format_preserving_hashing"

noising

Add laplacian noise to the input column in order to protect the privacy but output similar values.

Parameters: * sensitivity: Float: Amount of noise to be added * min: Float?: If there’s a hard minimum, transformation will truncate output values there if smaller * max: Float?: If there’s a hard maximum, transformation will truncate output values there if greater

Compatible modes: MASKING KEEP

Compatible column data types: NUMERIC

Supports multiple columns: No

Example:

column_params:
  - columns: ["product_price"]
    params:
      type: "noising"
      sensitivity: 23.47
      min: 0

redaction

Some values in the input string are substituted by the same value, obtaining partially masked text in the output.

Parameters: * action: str = "KEEP": Whether to KEEP or MASK values defined by which (default to KEEP) * which: str = "LAST": Which values (LAST or FIRST) to mask (or keep), depending on action * count: int = 4 amount of characters to be masked or kept, default to 4 * mask_with: char = '*' character used to mask values, default to ‘*’

Compatible modes: MASKING KEEP

Compatible column data types: NUMERIC

Supports multiple columns: No

Example:

column_params:
  - columns: ["credit_card"]
    params:
      type: "redaction"
      action: "MASK"
      which: "FIRST"
      count: 4
      maskWith: "#"

unique_id_hashing

Apply a hash transformation to a given value so that the output is encrypted but structural coherence is preserved (same input hashed with same key is always going to produce same output). Output values are unique.

This transformation is applied to primary and foreign keys by default in MASK mode.

No parameters.

Compatible modes: MASKING KEEP

Compatible column data types: INTEGER

Supports multiple columns: No

Example:

column_params:
  - columns: ["card_id"]
    params:
      type: "unique_id_hashing"