Transformations List#
Generators List#
Generators are compatible with all table modes.
categorical_generator
Randomly sample from a given set of categories and probabilities. Probabilities and categories can be provided or learned from data. If given, both parameters are required.
Parameters:
categories: List<String>?
: List of categories to be sampled from. Supportedstring
andboolean
types.probabilities: List<Double>?
: Probabilities for each category (must have same size ascategories
)
Compatible modes: GENERATION MASKING KEEP
Compatible column data types: STRING
, BOOLEAN
Supports multiple columns: No
Example:
column_params:
- columns:
- "transaction_type"
params:
type: "categorical_generator"
categories:
type: string
values:
- "SENT"
- "RECEIVED"
probabilities:
- 0.6
- 0.4
continuous_generator
Output data is sampled from a parameterized continuous distribution. If parameters are not given, they will be fitted from the original data.
Parameters:
mean: Double?
: Mean of the sampled distributionstd: Double?
: Standard Deviationmin: Double?
: Minimum valuemax: Double?
: Maximum valueround: Int = 0
: If given, output data will be rounded to this number of digits
Compatible modes: GENERATION MASKING KEEP
Compatible column data types: NUMERIC
Supports multiple columns: No
Example:
column_params:
- columns:
- "amount"
params:
type: "continuous_generator"
mean: 354.21
std: 98.96
min: 0.0
quantile_generator
Given a list of probabilities and bin edges, the output data is sampled from a mixture of uniform distributions,
where each uniform distribution i
is chosen with probability probabilities[i]
and its edges are given by
bin_edges[i]
and bin_edges[i + 1]
. If parameters are not given, they will be fitted from the original data.
Parameters:
hist: List<Double>?
: Probabilities of each uniform distribution.bin_edges: List<Double>?
: Bin edges of each uniform distribution.
Compatible modes: GENERATION MASKING KEEP
Compatible column data types: NUMERIC
Supports multiple columns: No
Example:
column_params:
- columns:
- "amount"
params:
type: "quantile_generator"
hist: [-2.1, 0.0, 3.4, 5.6]
bin_edges: [0.3, 0.45, 0.25]
date_generator
Output data is sampled from a parameterized continuous distribution, and transformed into dates. If parameters are not given, they will be extracted from the original data
Parameters:
mean: LocalDateTime?
: Average date of the sampled distributionstd: Int?
: Standard deviation in millisecondsmin: LocalDateTime?
: Minimum valuemax: LocalDateTime?
: Maximum value
Compatible modes: GENERATION MASKING KEEP
Compatible column data types: DATE
Supports multiple columns: No
Example:
column_params:
- columns:
- "date_of_birth"
params:
type: "date_generator"
formatted_string_generator
Generate a string column based on a given pattern. If the pattern is not given, will generate random characters with similar length as original column.
Parameters:
pattern: String?
: Regular expression pattern used to sample data from
Compatible modes: GENERATION MASKING KEEP
Compatible column data types: STRING
Supports multiple columns: No
Example:
column_params:
- columns:
- "phone_number"
params:
type: "formatted_string_generator"
pattern: "\\+44[0-9]{10}"
int_sequence_generator
Generate a sequence of integers that represent a unique id column that contain unique values.
Parameters:
start_from: Int = 0
: Where to start the sequence from, default to 0. If the generator is used on existing data, this should be used as the maximum of the existing data plus 1.
Compatible modes: GENERATION MASKING KEEP
Compatible column data types: INTEGER
Supports multiple columns: No
Example:
column_params:
- columns:
- "user_id"
params:
type: "int_sequence_generator"
string_sequence_generator
Generate a sequence of strings that represent a unique id column that contain unique values, including uppercase alphabetic and numeric values.
Parameters:
length: Int?
: Maximum length of the column, extracted from the database DDL if not given
Compatible modes: GENERATION MASKING KEEP
Compatible column data types: STRING
Supports multiple columns: No
Example:
column_params:
- columns:
- "country_id"
params:
type: "string_sequence_generator"
null_generator
The output column is filled with null values
No parameters.
Compatible modes: GENERATION MASKING KEEP
Compatible column data types: ANY
Supports multiple columns: Yes
Example:
column_params:
- columns:
- "empty_column"
params:
type: "null_generator"
constant
Generate a single numeric value for the entire column
Parameters:
value: Number?
: numeric value to generate
Compatible modes: GENERATION MASKING
Compatible column data types: NUMERIC
Supports multiple columns: No
Example:
column_params:
- columns: [ "balance" ]
params:
type: "constant"
value: 0.0
person_generator
Generate personal fields (e.g. name, surname, title) and keep them consistent across columns.
Parameters:
column_templates: List<String>
: For each column, the template to be used to generate personal dataconsistent_with_column: String?
: If given, the column that need to be consistent on. For example, ifconsistent_with_column="user_id"
all people with sameuser_id
will have the same namelocale: String = 'en-GB'
: To generate names from different geographical areas, the user can change this parameter. Default to ‘en-GB’, which corresponds to British names.
Available templates are:
${email}
${first_name}
${male_first_name}
${female_first_name}
${last_name}
Supported locales:
|
|
|
|
|
|
|
Compatible modes: GENERATION MASKING KEEP
Compatible column data types: STRING
Supports multiple columns: Yes
Example for several columns:
column_params:
- columns: ["first_name", "last_name"]
params:
type: "person_generator"
column_templates: ["${first_name}", "${last_name}"]
Example for a single column:
column_params:
- columns: ["full_name"]
params:
type: "person_generator"
column_templates: ["${first_name} ${last_name}"]
address_generator
Generate address fields (e.g. street, zip code) and keep them consistent across columns.
Parameters:
column_templates: List<String>
: For each column, the template to be used to generate address dataconsistent_with_column: String?
: If given, the column that need to be consistent on. For example, ifconsistent_with_column="user_id"
all people with sameuser_id
will have the same streetlocale: String = 'en-GB'
: To generate addresses from different geographical areas, the user can change this parameter. Default to ‘en-GB’, which corresponds to Great Britain addresses.
Available templates are:
${zip_code}
${country}
${city}
${street_name}
${house_number}
${flat_number}
Supported locales:
|
|
|
|
|
|
|
Compatible modes: GENERATION MASKING KEEP
Compatible column data types: STRING
Supports multiple columns: Yes
Example for several columns:
column_params:
- columns: ["street_name", "zip_code"]
params:
type: "address_generator"
column_templates: ["${street_name}", "${zip_code}"]
Example for a single column:
column_params:
- columns: ["address"]
params:
type: "address_generator"
column_templates: ["${country}, ${city}, ${street_name}, ${house_number}, ${flat_number}, ${zip_code}"]
Masking Transformers#
Masking transformers can only be applied for KEEP
and MASKING
modes. When applying these transformers, the
output will contain a one-to-one transformation of each input row.
passthrough
The output data is equal to the input, no transformation is applied.
No parameters.
Compatible modes: MASKING KEEP
Compatible column data types: ANY
Supports multiple columns: Yes
Example:
column_params:
- columns: ["customer_number", "plate"]
params:
type: "passthrough"
format_preserving_hashing
A hash transformation is applied to each alphanumeric character in a given text so that the output preserves the format but contains different characters. This transformation is secure and non-reversible.
No parameters.
Compatible modes: MASKING KEEP
Compatible column data types: STRING
Supports multiple columns: No
Example:
column_params:
- columns: ["registration_number"]
params:
type: "format_preserving_hashing"
noising
Add laplacian noise to the input column in order to protect the privacy but output similar values.
Parameters:
* sensitivity: Float
: Amount of noise to be added
* min: Float?
: If there’s a hard minimum, transformation will truncate output values there if smaller
* max: Float?
: If there’s a hard maximum, transformation will truncate output values there if greater
Compatible modes: MASKING KEEP
Compatible column data types: NUMERIC
Supports multiple columns: No
Example:
column_params:
- columns: ["product_price"]
params:
type: "noising"
sensitivity: 23.47
min: 0
redaction
Some values in the input string are substituted by the same value, obtaining partially masked text in the output.
Parameters:
* action: str = "KEEP"
: Whether to KEEP
or MASK
values defined by which
(default to KEEP
)
* which: str = "LAST"
: Which values (LAST
or FIRST
) to mask (or keep), depending on action
* count: int = 4
amount of characters to be masked or kept, default to 4
* mask_with: char = '*'
character used to mask values, default to ‘*’
Compatible modes: MASKING KEEP
Compatible column data types: NUMERIC
Supports multiple columns: No
Example:
column_params:
- columns: ["credit_card"]
params:
type: "redaction"
action: "MASK"
which: "FIRST"
count: 4
maskWith: "#"
unique_id_hashing
Apply a hash transformation to a given value so that the output is encrypted but structural coherence is preserved (same input hashed with same key is always going to produce same output). Output values are unique.
This transformation is applied to primary and foreign keys by default in MASK
mode.
No parameters.
Compatible modes: MASKING KEEP
Compatible column data types: INTEGER
Supports multiple columns: No
Example:
column_params:
- columns: ["card_id"]
params:
type: "unique_id_hashing"