Generation
Instead of masking the original data, we can generate the data from scratch. It might be useful if we want to produce more data than we have in the source database. For example, if we want to perform load testing. Let’s generate some data:
default_config:
mode: GENERATION
target_ratio: 2
table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
safety_mode: "RELAXED"
We can check results using this SQL:
select * from (
select 'customers' tableName, count(*) numRows from customers UNION
select 'employees' tableName, count(*) numRows from employees UNION
select 'orderdetails' tableName, count(*) numRows from orderdetails UNION
select 'orders' tableName, count(*) numRows from orders UNION
select 'payments' tableName, count(*) numRows from payments UNION
select 'productlines' tableName, count(*) numRows from productlines UNION
select 'products' tableName, count(*) numRows from products
) s order by 1;
Here we can check all relations works well:
SELECT
orderNumber,
orderDate,
customerName,
orderLineNumber,
productName,
quantityOrdered,
priceEach
FROM
orders
INNER JOIN orderdetails
USING (orderNumber)
INNER JOIN products
USING (productCode)
INNER JOIN customers
USING (customerNumber)
ORDER BY
orderNumber,
orderLineNumber;
Again, we can override the default table setting in tables
.
In this example, we are setting a different target ratio for specific tables:
default_config:
mode: GENERATION
target_ratio: 2
tables:
- table_name_with_schema: "public.products"
mode: "GENERATION"
target_ratio: 2
transformations:
- columns: [ "productname" ]
params:
type: "formatted_string_generator"
pattern: "[A-Z]{16}"
table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
safety_mode: "RELAXED"
Some generators are "fake" ones like formatted_string_generator below, which produces values independently of the original data. But some use original data to fit a model and create values following underlying distribution.
|
Here is an example of using categorical_generator
with provided probabilities:
default_config:
mode: GENERATION
target_ratio: 2
tables:
- table_name_with_schema: "public.orders"
mode: "GENERATION"
transformations:
- columns: [ "status" ]
params:
type: "categorical_generator"
categories:
type: string
values:
- "Shipped"
- "Cancelled"
probabilities:
- 0.8
- 0.2
table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
safety_mode: "RELAXED"
You omit categories
and probabilities
, and they will be learned from the original data.