Transformations
Transformations
optional Object.
Parameters of a transformation. All parameters have a type key with the type name of the transformation, and other parameters that are transformation-specific.
Depending on type
property value, can be one of the following:
Key |
Link |
Modes |
Data types |
Multiple columns |
|
GENERATION, MASKING, KEEP |
ANY |
Yes |
|
|
GENERATION, MASKING, KEEP |
ANY |
Yes |
|
|
GENERATION, MASKING, KEEP |
NUMERIC |
No |
|
|
GENERATION, MASKING, KEEP |
NUMERIC |
No |
|
|
GENERATION, MASKING, KEEP |
ANY |
Yes |
|
|
GENERATION, MASKING, KEEP |
ANY |
Yes |
|
|
GENERATION, MASKING, KEEP |
ANY |
Yes |
|
|
MASKING, KEEP |
TEXT |
No |
|
|
GENERATION, MASKING, KEEP |
ANY |
No |
|
|
GENERATION, MASKING, KEEP |
NUMERIC |
No |
|
|
GENERATION, MASKING, KEEP |
TEXT |
No |
|
|
MASKING, KEEP |
NUMERIC |
No |
|
|
GENERATION, MASKING, KEEP |
ANY |
Yes |
|
|
MASKING, KEEP |
ANY |
Yes |
|
|
GENERATION, MASKING, KEEP |
TEXT |
Yes |
|
|
GENERATION, MASKING, KEEP |
TEXT |
Yes |
|
|
GENERATION, MASKING, KEEP |
TEXT |
Yes |
|
|
MASKING, KEEP |
TEXT |
No |
|
|
MASKING, KEEP |
NUMERIC |
No |
|
|
MASKING, KEEP |
DATE |
No |
|
|
GENERATION, MASKING, KEEP |
DATE |
No |
|
|
GENERATION, MASKING, KEEP |
ANY |
No |
|
|
GENERATION, MASKING, KEEP |
NUMERIC |
No |
|
|
GENERATION, MASKING, KEEP |
TEXT |
No |
|
|
GENERATION, MASKING, KEEP |
DATE |
No |
|
|
GENERATION, MASKING, KEEP |
BOOLEAN |
No |
|
|
GENERATION, MASKING, KEEP |
ANY |
Yes |
|
|
GENERATION, MASKING, KEEP |
ANY |
No |
|
|
MASKING, KEEP |
ANY |
No |
|
|
MASKING, KEEP |
ANY |
No |
|
|
GENERATION, MASKING, KEEP |
ANY |
Yes |
|
|
GENERATION, MASKING, KEEP |
ANY |
Yes |
Categorical generator
Randomly sample from a given key / value of categories and weights. Weights and categories can be provided from various datasources or learned from data.
The following example demonstrates how to accomplish this:
transformations:
- columns:
- transaction_type
params:
type: categorical_generator
categories:
value_source: PROVIDED
values:
"sent": 0.5
"received": 0.4
"skipping_value": 0.0
"null": 0.1
You can also specify a CSV file as the source of categories. The following example illustrates the complete configuration example:
transformations:
- columns:
- transaction_type
params:
type: categorical_generator
categories:
value_source: CSV_FILE
path: src/e2e/resources/data_with_header.csv
null_values: ["null", ""]
format:
columns:
column_accessor_type: NAME
categories: "title"
weights: "rank"
encoding: "UTF-8"
delimiter: ","
trim: true
For simple scenarios, the minimal configuration may be useful, which relies on the defaults mentioned earlier. The following example illustrates this approach:
transformations:
- columns:
- transaction_type
params:
type: categorical_generator
categories:
value_source: CSV_FILE
path: src/e2e/resources/data.csv
This transformation can be applied to several columns at once. This illustrates how the generator can handle multiple columns with provided tuples of categories that always appear together:
transformations:
- columns:
- productcode
- productname
params:
type: categorical_generator
categories:
value_source: MULTIPLE_PROVIDED
null_values: ["nil"]
category_values:
- values:
productcode: "P1"
productname: "Product 1"
weight: 0.5
- values:
productcode: "P2"
productname: "Product 2"
weight: 0.3
- values:
productcode: "P3"
productname: "Product 3"
weight: 0.2
- values:
productcode: "nil"
productname: "nil"
weight: 0.5
You can also specify a CSV file as the source of categories for multiple columns.
transformations:
- columns:
- productcode
- productname
params:
type: categorical_generator
categories:
value_source: MULTIPLE_CSV_FILE
path: src/e2e/resources/data_multi.csv
Example of advanced configuration with multiple columns and multiple categories:
transformations:
- columns:
- productcode
- productname
params:
type: categorical_generator
categories:
value_source: MULTIPLE_CSV_FILE
path: src/e2e/resources/data_with_header_multi.csv
null_values: ["null", ""]
format:
columns:
column_accessor_type: NAME
categories:
productcode: "code"
productname: "name"
weights: "rank"
encoding: "UTF-8"
delimiter: ","
trim: true
This generator normalizes weights to the probability interval. If the sum of weights exceeds the capacity of the double format, you need to use weights with a smaller scale factor.
|
Properties
-
type = categorical_generator
-
categories
:Categories source configuration
.
Compatible modes: GENERATION, MASKING, KEEP
Compatible column data types: ANY
Supports multiple columns: Yes
Conditional generator
Uses one of two transformations/generators depending on the value of the given field of the parent table. For example, using conditional generator one may use different generators depending on the value of "gender" column of the parent table.
Example:
transformations:
- columns: [ "status" ]
params:
type: conditional_generator
conditional_table: "public.delivery"
conditional_column: "status"
conditional_value: "DONE"
if_true:
type: constant_string
value: "CLOSED"
if_false:
type: constant_string
value: "OPEN"
Properties
-
type = conditional_generator
-
conditional_column
:String.
Parent column.
-
conditional_table
: optionalString.
Parent table.
-
conditional_value
:String.
Value to be compared with. If the value of the parent column is equal toconditional_value
, thenif_true
generator is used, otherwiseif_false
generator.
-
if_false
:Transformations
.
-
if_true
:Transformations
.
Compatible modes: GENERATION, MASKING, KEEP
Compatible column data types: ANY
Supports multiple columns: Yes
Continuous generator
Output data is sampled from a parameterized continuous distribution. If parameters are not given, they will be fitted from the original data.
Example:
transformations:
- columns:
- "amount"
params:
type: continuous_generator
mean: 354.21
std: 98.96
min: 0.0
Properties
-
type = continuous_generator
-
mean
: optionalNumber (double).
Mean of the sampled distribution
-
std
: optionalNumber (double).
Standard Deviation
-
min
: optionalNumber (double).
Minimum value
-
max
: optionalNumber (double).
Maximum value
-
numeric_type
:Numeric type
.
-
round
:Integer.
If given, output data will be rounded to this number of digits
Compatible modes: GENERATION, MASKING, KEEP
Compatible column data types: NUMERIC
Supports multiple columns: No
Quantile generator
Given a list of probabilities and bin edges, the output data is sampled from a mixture of uniform distributions, where each uniform distribution i
is chosen with probability probabilities[i]
and its edges are given by bin_edges[i]
and bin_edges[i + 1]
. If parameters are not given, they will be fitted from the original data.
Example:
transformations:
- columns: ["amount"]
params:
type: quantile_generator
hist: [0.5, 0.2, 0.3]
bin_edges: [0.1, 0.15, 0.3, 0.45]
numeric_type: DOUBLE
Properties
-
type = quantile_generator
-
hist
: optional array ofNumber (double).
Probabilities of each uniform distribution.
-
bin_edges
: optional array ofNumber (double).
Bin edges of each uniform distribution.
-
numeric_type
:Numeric type
.
Compatible modes: GENERATION, MASKING, KEEP
Compatible column data types: NUMERIC
Supports multiple columns: No
Copy parent generator
Copies values from parent table. Can be used for de-normalization of the database, e. g. for copying address
or phone_number
field from customers
table to orders
table.
Example:
transformations:
- columns: [ "phone_number" ]
params:
type: copy_parent_generator
parent_tables: [ "public.employees" ]
parent_columns: [ "phone_number" ]
Properties
-
type = copy_parent_generator
-
parent_columns
: array ofString.
Columns to copy the values from.
-
parent_tables
: array ofString.
Tables to copy the values from.
Compatible modes: GENERATION, MASKING, KEEP
Compatible column data types: ANY
Supports multiple columns: Yes
Foreign key generator
Fills columns with the parent table’s primary key values of a random row.
Normally this generator is being created implicitly wherever data generation for tables related with foreign keys is needed, so it should not be explicitly set up by the user. |
Properties
-
type = foreign_key_generator
-
distribution
:Foreign key distribution
.
-
parent_data_mode
:Parent data mode
.
-
referred_schema
: optionalString.
-
referred_table
: optionalString.
-
referred_fields
: optional array ofString.
Compatible modes: GENERATION, MASKING, KEEP
Compatible column data types: ANY
Supports multiple columns: Yes
Unique generator
This generator is intended for the case where primary key values are part of the foreign key.
Normally this generator is being created implicitly wherever data generation for tables related with foreign keys is needed, so it should not be explicitly set up by the user. |
Properties
-
type = unique_generator
-
transformations
: optional array ofColumn transformation parameters
.
Compatible modes: GENERATION, MASKING, KEEP
Compatible column data types: ANY
Supports multiple columns: Yes
Format preserving hashing
A hash transformation is applied to each character, which included into the configured group, in a given text so that the output preserves the format but contains different characters. This transformation is secure and non-reversible.
Examples:
Default configuration:
transformations:
- columns: ["registration_number"]
params:
type: format_preserving_hashing
groups:
- selector:
type: digits
alphabets:
- type: digits
- selector:
type: lower_letters
alphabets:
- type: lower_letters
- selector:
type: upper_letters
alphabets:
- type: upper_letters
Mask only last 5 characters:
transformations:
- columns: ["registration_number"]
params:
type: format_preserving_hashing
filter:
type: "last"
n: 5
Mask only substring ignoring case:
transformations:
- columns: ["registration_number"]
params:
type: format_preserving_hashing
filter:
type: substring
substring: sub
ignore_case: true
Mask only a set of characters ignoring case:
transformations:
- columns: ["registration_number"]
params:
type: format_preserving_hashing
filter:
type: characters
characters: "abc"
ignore_case: true
Mask characters selected by regex with a custom alphabet:
transformations:
- columns: ["phone_number"]
params:
type: format_preserving_hashing
groups:
- selector:
type: regex
pattern: "[123]"
alphabets:
- type: custom
parts:
- type: characters
characters: "456"
- type: characters
characters: "789"
- type: unicode_block
name: LATIN_EXTENDED_D
- type: unicode_block
name: "Latin Extended-A"
- type: unicode_range
from: 0x0D00
to: 0x0D7F
Properties
-
type = format_preserving_hashing
-
groups
: array ofHashing group
.
Hashing groups to apply on top of the specified filter. There can be multiple groups configured. In that case the groups will be tried to match a region within the filtered value in the order they are specified in configuration. If a match is successfully found, the corresponding group’s alphabet will be used for transformation, and no other groups will be tried for that region. This implies that most specific hashing groups must be specified first in the configuration. Unspecified parameter or null
is equivalent to the following:
transformations:
- columns: ["registration_number"]
params:
type: format_preserving_hashing
groups:
- selector:
type: digits
alphabets:
- type: digits
- selector:
type: lower_letters
alphabets:
- type: lower_letters
- selector:
type: upper_letters
alphabets:
- type: upper_letters
- selector:
type: word_characters
alphabets:
- type: digits
- type: lower_letters
- type: upper_letters
-
filter
:Format preserving hashing filter
.
Compatible modes: MASKING, KEEP
Compatible column data types: TEXT
Supports multiple columns: No
Formatted string generator
Generate a string column based on a given pattern. If the pattern is not given, will generate random characters with similar length as original column.
Example:
transformations:
- columns:
- "phone_number"
params:
type: formatted_string_generator
pattern: "\\+44[0-9]{10}"
Properties
-
type = formatted_string_generator
-
pattern
: optionalString.
Regular expression pattern used to sample data from
Compatible modes: GENERATION, MASKING, KEEP
Compatible column data types: ANY
Supports multiple columns: No
Integers sequence generator
Generate a sequence of integers that represent a unique id column that contain unique values.
Properties
-
type = int_sequence_generator
-
start_from
: optionalInteger.
Where to start the sequence from, default to 0. If the generator is used on existing data, this should be used as the maximum of the existing data plus 1.
Example:
transformations:
- columns:
- "user_id"
params:
type: int_sequence_generator
Compatible modes: GENERATION, MASKING, KEEP
Compatible column data types: NUMERIC
Supports multiple columns: No
String sequence generator
Generate a sequence of strings that represent a unique id column that contain unique values, including uppercase alphabetic and numeric values.
Example:
transformations:
- columns: ["country_id"]
params:
type: string_sequence_generator
Properties
-
type = string_sequence_generator
-
length
: optionalInteger.
Maximum length of the column, extracted from the database DDL if not given
-
start_from
: optionalString.
Where to start the sequence from, default empty string. If the generator is used on existing data, this should be used as the maximum of the existing data with shift.
-
alphabets
: array ofAlphabet
.
Compatible modes: GENERATION, MASKING, KEEP
Compatible column data types: TEXT
Supports multiple columns: No
Noising transformation
Add laplacian noise to the input column in order to protect the privacy but output similar values.
Example:
transformations:
- columns: ["product_price"]
params:
type: noising
sensitivity: 23.47
min: 0
Properties
-
type = noising
-
sensitivity
: optionalNumber (double).
Amount of noise to be added
-
min
: optionalNumber (double).
If there’s a hard minimum, transformation will truncate output values there if smaller
-
max
: optionalNumber (double).
If there’s a hard maximum, transformation will truncate output values there if greater
Compatible modes: MASKING, KEEP
Compatible column data types: NUMERIC
Supports multiple columns: No
Null generator
The output column is filled with null values
Example:
transformations:
- columns: ["empty_column"]
params:
type: null_generator
Properties
-
type = null_generator
Compatible modes: GENERATION, MASKING, KEEP
Compatible column data types: ANY
Supports multiple columns: Yes
Passthrough transformation
The output data is equal to the input, no transformation is applied.
Example:
transformations:
- columns: ["customer_number", "plate"]
params:
type: passthrough
Properties
-
type = passthrough
Compatible modes: MASKING, KEEP
Compatible column data types: ANY
Supports multiple columns: Yes
Person generator
Generate personal fields (e.g., name, surname, title) and keep them consistent across columns.
Available templates are:
-
${email}
-
${first_name}
-
${male_first_name}
-
${female_first_name}
-
${last_name}
-
${full_name}
-
${username}
-
${company}
-
${phone_national}
-
${phone_international}
-
${ssn}
Supported locales:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Example for several columns:
transformations:
- columns: ["first_name", "last_name"]
params:
type: person_generator
column_templates: ["${first_name}", "${last_name}"]
Example for a single column:
transformations:
- columns: ["full_name"]
params:
type: person_generator
column_templates: ["${first_name} ${last_name}"]
Properties
-
type = person_generator
-
column_templates
: array ofString.
For each column, the template to be used to generate personal data
-
consistent_with_column
:String.
If given, the column that need to be consistent on. For example, ifconsistent_with_column="user_id"
all people with sameuser_id
will have the same name. The "self" value means consistency with the source value.
-
locale
: optionalString.
To generate names from different geographical areas, the user can change this parameter. Default to 'en-GB', which corresponds to British names.
-
length_exceeded_mode
:Value length exceeded mode
.
-
column_lengths
: optional array ofInteger.
Max lengths allowable for the column. Ignored when length_exceeded_mode: "IGNORE"`
Compatible modes: GENERATION, MASKING, KEEP
Compatible column data types: TEXT
Supports multiple columns: Yes
Address generator
Generate address fields (e.g., street, zip code) and keep them consistent across columns. Available templates are:
-
${zip_code}
-
${country}
-
${city}
-
${street_name}
-
${house_number}
-
${flat_number}
-
${full_address}
-
${street_address}
-
${region}
-
${latitude}
-
${longitude}
-
${coordinates}
-
${time_zone}
Supported locales:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Example for several columns:
transformations:
- columns: [ "street_name", "zip_code" ]
params:
type: address_generator
column_templates: [ "${street_name}", "${zip_code}" ]
Example for a single column:
transformations:
- columns: ["address"]
params:
type: address_generator
column_templates: ["${country}, ${city}, ${street_name}, ${house_number}, ${flat_number}, ${zip_code}"]
Properties
-
type = address_generator
-
column_templates
: array ofString.
For each column, the template to be used to generate address data
-
consistent_with_column
:String.
If given, the column that need to be consistent on. For example, ifconsistent_with_column="user_id"
all people with sameuser_id
will have the same street. The "self" value means consistency with the source value.
-
locale
: optionalString.
To generate addresses from different geographical areas, the user can change this parameter. Default to 'en-GB', which corresponds to Great Britain addresses.
-
length_exceeded_mode
:Value length exceeded mode
.
-
column_lengths
: optional array ofInteger.
Max lengths allowable for the column. Ignored when length_exceeded_mode: "IGNORE"`
Compatible modes: GENERATION, MASKING, KEEP
Compatible column data types: TEXT
Supports multiple columns: Yes
Finance generator
Generate financial data.
Available templates:
-
${credit_card}
-
${bic}
-
${iban}
-
${nasdaq_ticker}
-
${nyse_ticker}
-
${stock_market}
-
${us_routing_number}
The template credit_card
(without the card type qualification) will result in a random type being picked.
The credit card type can be configured using the following templates:
-
${credit_card.visa}
-
${credit_card.mastercard}
-
${credit_card.discover}
-
${credit_card.american_express}
-
${credit_card.diners_club}
-
${credit_card.jcb}
-
${credit_card.switch}
-
${credit_card.solo}
-
${credit_card.dankort}
-
${credit_card.forbrugsforeningen}
-
${credit_card.laser}
Example:
transformations:
- columns: [ "credit_card" ]
params:
type: finance_generator
column_templates: [ "${credit_card.visa}" ]
Properties
-
type = finance_generator
-
column_templates
: array ofString.
For each column, the template to be used to generate financial data
-
consistent_with_column
:String.
If given, the column that need to be consistent on. The "self" value means consistency with the source value.
-
length_exceeded_mode
:Value length exceeded mode
.
-
column_lengths
: optional array ofInteger.
Max lengths allowable for the column. Ignored when length_exceeded_mode: "IGNORE"`
Compatible modes: GENERATION, MASKING, KEEP
Compatible column data types: TEXT
Supports multiple columns: Yes
Redaction masker
Some values in the input string are substituted by the same value, obtaining partially masked text in the output.
Example:
transformations:
- columns: ["credit_card"]
params:
type: redaction
action: MASK
which: FIRST
count: 4
mask_with: "#"
Properties
-
type = redaction
-
action
:Action
.
-
which
:Position
.
-
count
:Integer.
amount of characters to be masked or kept, default to 4
-
mask_with
:String.
character used to mask values, default to*
Compatible modes: MASKING, KEEP
Compatible column data types: TEXT
Supports multiple columns: No
Unique hashing
Apply a hash transformation to a given value so that the output is encrypted but structural coherence is preserved (same input hashed with same key is always going to produce same output). Output values are unique.
This transformation is applied to primary and foreign keys by default in MASKING
mode.
Example:
transformations:
- columns: ["card_id"]
params:
type: unique_hashing
Properties
-
type = unique_hashing
-
max_value
:Number (double).
Max value to generate, null means absence of limit
-
precision
:Integer.
Max precision to generate (e.g. if the value is 3, the maximal value is 999), null means absence of limit. Minimal value is applied if both max_value and precision are specified
Compatible modes: MASKING, KEEP
Compatible column data types: NUMERIC
Supports multiple columns: No
Date unique hashing
Apply a hash transformation to a date time format value so that the output is encrypted but structural coherence is preserved (same input hashed with same key is always going to produce same output). Output values are unique.
This transformation is applied to primary and foreign keys by default in MASKING
mode.
Example:
transformations:
- columns: ["create_date"]
params:
type: date_time_unique_hashing
min: 2000-01-01T12:00:00Z
max: 2022-01-01T12:00:00Z
Properties
-
type = date_time_unique_hashing
-
min
: optionalString (date-time).
Minimum value
-
max
: optionalString (date-time).
Maximum value
Compatible modes: MASKING, KEEP
Compatible column data types: DATE
Supports multiple columns: No
Date generator
Output data is sampled from a parameterized continuous distribution, and transformed into dates. If parameters are not given, they will be extracted from the original data
Example:
transformations:
- columns:
- "date_of_birth"
params:
type: date_generator
mean: 2018-02-01T12:00:00Z
std: 2d 4h 45m 12s 434ms
min: 2000-01-01T12:00:00Z
max: 2022-01-01T12:00:00Z
Properties
-
type = date_generator
-
mean
: optionalString (date-time).
Average date of the sampled distribution
-
std
: optionalString.
Standard deviation. The following formats are accepted:-
ISO-8601 Duration format, e.g.,
P1DT2H3M4.058S
. -
The concise format described here, e.g.,
10s
,1h 30m
or-(1h 30m)
-
Milliseconds without the specific unit, e.g.,
12534
.
-
-
min
: optionalString (date-time).
Minimum value
-
max
: optionalString (date-time).
Maximum value
Compatible modes: GENERATION, MASKING, KEEP
Compatible column data types: DATE
Supports multiple columns: No
UUID generator
The output column is filled with UUIDs.
Example:
transformations:
- columns: ["unique_id"]
params:
type: uuid_generator
Properties
-
type = uuid_generator
Compatible modes: GENERATION, MASKING, KEEP
Compatible column data types: ANY
Supports multiple columns: No
Constant numeric generator
Generates a single numeric value for the entire column
Example:
transformations:
- columns: [ "balance" ]
params:
type: constant_numeric
value: 0.0
Example (range):
transformations:
- columns: [ "balance" ]
params:
type: constant_numeric
min: 0.0
max: 10000.0
Properties
-
type = constant_numeric
-
value
: optionalNumber.
numeric value to generate
-
min
: optionalNumber.
The lower boundary for the value (inclusive)
-
max
: optionalNumber.
The upper boundary for the value (exclusive)
-
numeric_type
:Numeric type
.
Compatible modes: GENERATION, MASKING, KEEP
Compatible column data types: NUMERIC
Supports multiple columns: No
Constant string generator
Generates a single string value for the entire column
Example:
transformations:
- columns: [ "status" ]
params:
type: constant_string
value: "ACTIVE"
Properties
-
type = constant_string
-
value
: optionalString.
string value to generate
Compatible modes: GENERATION, MASKING, KEEP
Compatible column data types: TEXT
Supports multiple columns: No
Constant date generator
Generates a single date value for the entire column
Example:
transformations:
- columns: [ "creation_date" ]
params:
type: constant_date
value: 2022-07-28T12:21:00Z
Example (range):
transformations:
- columns: [ "creation_date" ]
params:
type: constant_date
min: 2022-07-01T00:00:00Z
max: 2022-07-31T23:59:59Z
Properties
-
type = constant_date
-
value
: optionalString (date-time).
date value to generate
-
min
: optionalString (date-time).
The lower boundary for the value (inclusive)
-
max
: optionalString (date-time).
The upper boundary for the value (exclusive)
Compatible modes: GENERATION, MASKING, KEEP
Compatible column data types: DATE
Supports multiple columns: No
Constant boolean generator
Generates a single boolean value for the entire column
Example:
transformations:
- columns: [ "is_active" ]
params:
type: constant_boolean
value: true
Properties
-
type = constant_boolean
-
value
: optionalBoolean.
boolean value to generate
Compatible modes: GENERATION, MASKING, KEEP
Compatible column data types: BOOLEAN
Supports multiple columns: No
Loop generator
Generates a sequence of elements in a loop that can be repeated. Data can be obtained from various sources or generated based on existing data. The following example demonstrates how to accomplish this:
transformations:
- columns:
- transaction_type
params:
type: "loop_generator"
repeatable: true
source:
value_source: "PROVIDED"
values:
- "sent"
- "skipping_value"
- "received"
- null
This transformation can be applied to several columns at once. This illustrates how the generator can handle multiple columns with provided tuples of values that always appear together:
transformations:
- columns:
- productcode
- productname
params:
type: "loop_generator"
repeatable: true
source:
value_source: "MULTIPLE_PROVIDED"
values:
- productcode: "P1"
productname: "Product 1"
- productcode: "P2"
productname: "Product 2"
- productcode: "P3"
productname: "Product 3"
- productcode: null
productname: null
You can also specify a CSV file as the source of elements. The following example illustrates the complete configuration example:
transformations:
- columns:
- productcode
- productname
params:
type: "loop_generator"
repeatable: true
source:
value_source: "CSV_FILE"
path: src/e2e/resources/data_with_header_multi.csv
null_values: null
format:
encoding: "UTF-8"
delimiter: ","
trim: true
columns:
column_accessor_type: NAME
names: ["code", "name"]
Properties
-
type = loop_generator
-
repeatable
:Boolean.
The list will repeat itself in a loop if necessary. Otherwise, an exception will be thrown when the record number exceeds the size of the list.
-
source
:Source of elements
.
Compatible modes: GENERATION, MASKING, KEEP
Compatible column data types: ANY
Supports multiple columns: Yes
Constant XML generator
Generates a single XML value for the entire column Example:
transformations:
- columns: [ "xml_column" ]
params:
type: constant_xml
value: "<root>test</root>"
Properties
-
type = constant_xml
-
value
:String.
XML string value to generate
Compatible modes: GENERATION, MASKING, KEEP
Compatible column data types: ANY
Supports multiple columns: No
JSON Pointer Transformer
Transforms JSON value nodes indicated by JSON pointers. The rest of the values are kept as is.
Example:
transformations:
- columns: ["productspec"]
params:
type: "json_pointer_transformer"
specifications:
- pointers: [ "/sku" ]
transformation:
type: "format_preserving_hashing"
- pointers: [ "/tags/0" ]
transformation:
type: "format_preserving_hashing"
ignore_errors: true
Properties
-
type = json_pointer_transformer
-
specifications
: array ofJSON Pointer Transformer Specification
.
Compatible modes: MASKING, KEEP
Compatible column data types: ANY
Supports multiple columns: No
Xml XPath Transformer
Transforms XML value nodes indicated by XML XPath. The rest of the values are kept as is.
Example:
transformations:
- columns: ["productspec"]
params:
type: "xpath_transformer"
specifications:
- queries: [ "/sku" ]
transformation:
type: "format_preserving_hashing"
Properties
-
type = xpath_transformer
-
specifications
: array ofXml XPath Transformer Specification
.
-
encoding
:String.
This property enables to specify an encoding format.
Compatible modes: MASKING, KEEP
Compatible column data types: ANY
Supports multiple columns: No
Void Generator
An auxiliary transformer that throws an error when called. It is used only when it is necessary to ignore the processing of the entire table.
Properties
-
type = void_generator
Compatible modes: GENERATION, MASKING, KEEP
Compatible column data types: ANY
Supports multiple columns: Yes
Scripting Transformer
The scripting transformer allows you to implement own logic for both the GENERATION
mode and the MASKING
mode.
Currently only Javascript implementation is supported.
The script for GENERATION mode must define a lambda function that returns a dictionary where keys are column names, and the values are the desired values of the record.
If a transformer is applied to a single column, the value may be returned instead of a dictionary.
The script for the MASKING mode must define a lambda function with the two arguments ctx
and originalRecord
.
The following example shows how to use a custom script for multiple columns and MASKING mode.
transformations:
- columns:
- textdescription
- htmldescription
params:
type: scripting_transformer
language: "JAVASCRIPT"
script:
code: |
/**
* @typedef { Object.<string, *> | * } Result
*
* @param {GenerationContext} ctx
* @param {Record} originalRecord
* @returns {Result}
*/
(ctx, originalRecord) => {
const dict = originalRecord.asMap();
const textDescriptionColumn = columns.get(0);
const htmlDescriptionColumn = columns.get(1);
const descriptionWithoutSpaces = dict.get(textDescriptionColumn).trim();
return { [textDescriptionColumn]: descriptionWithoutSpaces, [htmlDescriptionColumn]: descriptionWithoutSpaces };
}
The script for the GENERATION mode should define a lambda function with the single argument ctx
.
The following example shows how to use a custom script for GENERATION:
transformations:
- columns:
- credit_card
params:
type: scripting_transformer
language: "JAVASCRIPT"
additional_properties:
first_credit_card_digit: 4
init_script:
code: |
/**
* @returns {String}
*/
function generateRandomCreditCardNumber() {
let creditCardNumber = additionalProperties["first_credit_card_digit"]
for (let i = 1; i < 16; i++) {
const digit = Math.floor(Math.random() * 10);
creditCardNumber += digit.toString();
}
return creditCardNumber;
}
script:
code: |
/**
* @typedef { Object.<string, *> | * } Result
*
* @param {GenerationContext} ctx
* @returns {Result}
*/
(ctx) => generateRandomCreditCardNumber();
Properties
-
type = scripting_transformer
-
language
:Scripting Language
.
-
script
:Script
.
-
init_script
:Script
.
-
additional_properties
:ScriptingAdditionalProperties
.
Compatible modes: GENERATION, MASKING, KEEP
Compatible column data types: ANY
Supports multiple columns: Yes
Categories source configuration
Used in: categories
optional Object.
Depending on value_source
property value, can be one of the following:
|
|
|
|
|
|
|
Numeric type
Used in: numeric_type
, numeric_type
, numeric_type
optional String.
Type of numbers used by a generator
- Enum values
-
-
INT
-
LONG
-
DOUBLE
-
FLOAT
-
BIG_DECIMAL
-
BIG_INTEGER
-
SHORT
-
BYTE
-
UNSIGNED_BYTE
-
UNSIGNED_INTEGER
-
UNSIGNED_LONG
-
UNSIGNED_SHORT
-
Foreign key distribution
Used in: distribution
String.
Distributions:
POISSON
(default) - Generates parent-child relations based on the Poisson distribution. Where lambda
represents the ratio of parent to child count.
ROUND_ROBIN
- Assign children to parents using a round-robin algorithm.
ORIGINAL
- Preserves the original reference ratio (Note: This option may require more computational
resources for recalculation of the model. It’s worth noting that this model may generate a distribution that does not perfectly match the original one).
- Enum values
-
-
POISSON
-
ROUND_ROBIN
-
ORIGINAL
-
Parent data mode
Used in: parent_data_mode
optional String.
What part of parent data to consider for the child table processing. Default is ALL
.
- Enum values
-
-
NEW
-
OLD
-
ALL
-
Column transformation parameters
Used in: transformations
Object.
List of column names associated with Transformation parameters.
Properties
-
columns
: array ofString.
List of columns that are affected by this generator.
-
params
:Transformations
.
Hashing group
Used in: groups
Object.
The pair of selector
and list of alphabet
. selector
is used to choose characters from the input string, alphabet
- is a set of characters, which are used to replace source ones.
Properties
-
selector
:Hashing group selector
.
-
alphabets
: array ofAlphabet
.
Format preserving hashing filter
Used in: filter
optional Object.
Depending on type
property value, can be one of the following:
|
|
|
|
|
|
|
|
|
Alphabet
Object.
Depending on type
property value, can be one of the following:
|
|
|
|
|
|
|
Value length exceeded mode
optional String.
Action, required on value length overflow.
Modes:
IGNORE
- error if the value exceeds column length
TRUNCATE
(default) - truncate value to the field length
- Enum values
-
-
IGNORE
-
TRUNCATE
-
Source of elements
Used in: source
optional Object.
Depending on value_source
property value, can be one of the following:
|
|
|
|
|
JSON Pointer Transformer Specification
Used in: specifications
Properties
-
pointers
: array ofString.
JSON Pointer (specified by RFC6901)
-
transformation
:Transformations
.
-
ignore_errors
:Boolean.
Controls the behaviour when no JSON node is found at the pointer or the node has a type incompatible with the specified transformer. If this setting istrue
, the found JSON node, if any, will remain unchanged. If the setting isfalse
, an error will be raised. Default isfalse
.
Xml XPath Transformer Specification
Used in: specifications
Properties
-
queries
: array ofString.
XPath (specified by https://www.w3.org/TR/xpath-31/)
-
transformation
:Transformations
.
-
ignore_errors
:Boolean.
Controls the behaviour when no XML node is found at the xPath or the node has a type incompatible with the specified transformer. If this setting istrue
, the found XML node, if any, will remain unchanged. If the setting isfalse
, an error will be raised. Default isfalse
.
Script
Used in: script
, init_script
optional Object.
The script
should define a lambda function to be called on every record or row depending on the chosen method.
The init_script
is executed once on start. It can be helpful for defining variables and functions which are available to use in the main script
Properties
-
code
: optionalString.
Script code
-
file
: optionalString.
Script file location. In the case of a local file system, the path can be absolute or relative to the application process’s working directory (not to be confused with working directory) The script can be located on local file system, AWS S3 and Google Storage.
To be able to load scripts from S3 the property TDK_AWS_ENABLED==true
should be set. More details can be found here.
The property TDK_GCP_ENABLED==true
allows loading scripts from Google Storage. More details can be found here.
ScriptingAdditionalProperties
Used in: additional_properties
optional map of String keys to
Properties
-
value
:Object.
Additional properties to be used in thescript
andinit_script
. The dictionary variableadditionalProperties
is available from scripts.
Config categories source
Config source involves configuring categories and weights directly within the configuration.
Properties
-
value_source = PROVIDED
-
type
:Dictionary data type
.
-
values
:Categories dictionary
.
-
null_values
: optional array ofString.
The values that should be treated asNULL
values. Default is["null"]
CSV file categories source
Properties
-
value_source = CSV_FILE
-
type
:Dictionary data type
.
-
path
:String.
The path to file on your file system.
-
format
:Parser configuration
.
-
null_values
: optional array ofString.
The values that should be treated asNULL
values. Default is["null"]
Config categories source
Config source involves configuring categories and weights for multiple columns directly within the configuration.
Properties
-
value_source = MULTIPLE_PROVIDED
-
column_types
:Columns data types
.
-
null_values
: optional array ofString.
The values that should be treated asNULL
values. Default is["null"]
-
category_values
: array ofConfig categories source
.
CSV file categories source for multiple columns
Properties
-
value_source = MULTIPLE_CSV_FILE
-
column_types
:Columns data types
.
-
null_values
: optional array ofString.
The values that should be treated asNULL
values. Default is["null"]
-
path
:String.
The path to file on your file system.
Hashing group selector
Used in: selector
Object.
Depending on type
property value, can be one of the following:
|
|
|
|
|
|
|
|
|
First N characters
Mask only first N characters of the input string
Properties
-
type = first
-
n
:Integer (int32).
Last N characters
Mask only last N characters of the input string
Properties
-
type = last
-
n
:Integer (int32).
Specified characters
Mask only specified characters of the input string
Properties
-
type = characters
-
characters
:String.
-
ignore_case
:Boolean.
Specified substring
Mask only specified substring of the input string
Properties
-
type = substring
-
substring
:String.
-
ignore_case
:Boolean.
Regex filter
Mask only characters filtered by regex
Properties
-
type = regex
-
pattern
:String.
-
ignore_case
:Boolean.
Custom alphabet
Custom alphabet which can consist of characters, unicode blocks and unicode ranges. In total it can be from 1 to (2^16) characters.
Properties
-
type = custom
-
parts
: optional array ofCustom alphabet part
.
Element’s source provided from a configuration
Properties
-
value_source = PROVIDED
-
column_type
:Dictionary data type
.
-
values
: array of optionalString.
The list of repeatable elements
Element’s source provided from a CSV file
Properties
-
value_source = CSV_FILE
-
path
:String.
The path to file on your file system.
-
format
:Parser configuration
.
-
column_types
:Columns data types
.
-
null_values
: optional array ofString.
The values that should be treated asNULL
values. Default is["null"]
Multi columns element’s source provided from a configuration
Properties
-
value_source = MULTIPLE_PROVIDED
-
column_types
:Columns data types
.
-
values
: array ofElements dictionary for multiple columns
.
Dictionary data type
Used in: type
, type
, column_type
optional String.
Data type of categories.
Modes:
STRING
(default) - Interpret values as strings
BOOLEAN
- Interpret values as booleans
NUMERIC
- Interpret values as doubles
- Enum values
-
-
STRING
-
BOOLEAN
-
NUMERIC
-
Categories dictionary
Used in: values
map of String keys to`Number.`
The map can store keys in one of three formats: STRING
, BOOLEAN
or NUMERIC
. The weights must be equal to or greater than zero.
Example of formats:
---- "23": 123 "owl": 534 "true": 532 ----
To disable one of the categories, simply set its weight value to zero.
Parser configuration
Used in: format
optional Object.
Properties
-
encoding
:String.
This property enables to specify an encoding format.
-
delimiter
:String.
You have the option to specify a custom delimiter.
-
trim
:Boolean.
The trim option removes leading and trailing spaces from each cell.
-
columns
:Column value extract method
.
Columns data types
Used in: column_types
, column_types
, column_types
, column_types
optional map of String keys to`Dictionary data type`.
Config categories source
Used in: category_values
The values for multiple columns of single category.
Properties
-
weight
:Number.
Parser configuration for multiple columns
Used in: format
optional Object.
Properties
-
encoding
:String.
This property enables to specify an encoding format.
-
delimiter
:String.
You have the option to specify a custom delimiter.
-
trim
:Boolean.
The trim option removes leading and trailing spaces from each cell.
Word characters (equivalent of regex '\w+' as described at this link)
Custom alphabet part
Used in: parts
optional Object.
Depending on type
property value, can be one of the following:
|
|
|
|
|
Parser configuration
Used in: format
optional Object.
Properties
-
encoding
:String.
This property enables to specify an encoding format.
-
delimiter
:String.
You have the option to specify a custom delimiter.
-
trim
:Boolean.
The trim option removes leading and trailing spaces from each cell.
-
columns
:Column value extract method
.
Elements dictionary for multiple columns
Used in: values
map of String keys to[.green]optional String.
Column value extract method
Used in: columns
optional Object.
There are multiple ways to access categories and weights from a CSV file. By default, when the columns
parameter is absent, the generator expects a file with more than one column, where the column with index 0
represents the category, and the column with index 1
represents the weight. In the default configuration, file headers are not accepted. However, if you need to work with specific columns, you can specify which columns will provide data using either the NAME
or INDEX
column parameter.
Depending on column_accessor_type
property value, can be one of the following:
|
|
|
Categories dictionary for multiple columns
Used in: values
map of String keys to`String.`
Column value extract method for multiple columns
Used in: columns
optional Object.
There are multiple ways to access categories and weights from a CSV file. By default, when the columns
parameter is absent, the generator expects a file with (table columns size + 1) columns, where the column with index 0
represents the first category, and the last column represents the weight. In the default configuration, file headers are not accepted. However, if you need to work with specific columns, you can specify which columns will provide data using either the NAME
or INDEX
column parameter.
Depending on column_accessor_type
property value, can be one of the following:
|
|
|
Character set
Custom alphabet which can consist of 1 to (2^16) characters. All printable characters from Unicode Basic Multilingual Plane are supported.
Properties
-
type = characters
-
characters
:String.
Unicode block
Unicode block by name. Name of the Unicode block formatted according to the results described in Java’s UnicodeBlock documentation. Examples: "BASIC_LATIN", "Basic Latin". Only the blocks from BMP (codepoints from 0x0000 to 0xFFFF) are supported. You can refer to the Unicode specification to find out the range for a block of interest.
Properties
-
type = unicode_block
-
name
:String.
Unicode range
Unicode range of characters specified by the first and last character range int codes. You can use ranges from 0x0000
to 0xFFFF
.
Properties
-
type = unicode_range
-
from
:Integer (int32).
-
to
:Integer (int32).
Column value extract method
Used in: columns
optional Object.
There are several ways to access the categories and weights from a CSV file. By default, if the columns
parameter is not specified, the generator will map the columns from the file to the transformation columns in order. In the default configuration, file headers are not accepted. However, if you need to work with specific columns, you can specify which columns will provide data using either the NAME
or INDEX
column parameter.
Depending on column_accessor_type
property value, can be one of the following:
|
|
|
String column name value extract
Properties
-
column_accessor_type = NAME
-
categories
:String.
-
weights
:String.
Column index value extract
Properties
-
column_accessor_type = INDEX
-
categories
:Integer.
-
weights
:Integer.
String column name value extract for multiple columns
Properties
-
column_accessor_type = NAME
-
categories
:Categories dictionary for multiple columns
.
-
weights
:String.
Column index value extract
Properties
-
column_accessor_type = INDEX
-
categories
:Categories dictionary for multiple columns
.
-
weights
:Integer.
Categories dictionary for multiple columns
Used in: categories
map of String keys to`String.`
Categories dictionary for multiple columns
Used in: categories
map of String keys to`Integer.`