Generation

Instead of masking the original data, we can generate the data from scratch. It might be useful if we want to produce more data than we have in the source database. For example, if we want to perform load testing. Let’s generate some data:

default_config:
  mode: GENERATION
  target_ratio: 2

table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
safety_mode: "RELAXED"

We can check results using this SQL:

select * from (
  select 'customers' tableName, count(*) numRows from customers UNION
  select 'employees' tableName, count(*) numRows from employees UNION
  select 'orderdetails' tableName, count(*) numRows from orderdetails UNION
  select 'orders' tableName, count(*) numRows from orders UNION
  select 'payments' tableName, count(*) numRows from payments UNION
  select 'productlines' tableName, count(*) numRows from productlines UNION
  select 'products' tableName, count(*) numRows from products
) s order by 1;

Here we can check all relations works well:

SELECT
    orderNumber,
    orderDate,
    customerName,
    orderLineNumber,
    productName,
    quantityOrdered,
    priceEach
FROM
    orders
        INNER JOIN orderdetails
                   USING (orderNumber)
        INNER JOIN products
                   USING (productCode)
        INNER JOIN customers
                   USING (customerNumber)
ORDER BY
    orderNumber,
    orderLineNumber;

Again, we can override the default table setting in tables. In this example, we are setting a different target ratio for specific tables:

default_config:
  mode: GENERATION
  target_ratio: 2

tables:
  - table_name_with_schema: "public.products"
    mode: "GENERATION"
    target_ratio: 2
    transformations:
      - columns: [ "productname" ]
        params:
          type: "formatted_string_generator"
          pattern: "[A-Z]{16}"

table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
safety_mode: "RELAXED"
Some generators are "fake" ones like formatted_string_generator below, which produces values independently of the original data. But some use original data to fit a model and create values following underlying distribution.

Here is an example of using categorical_generator with provided probabilities:

default_config:
  mode: GENERATION
  target_ratio: 2

tables:
  - table_name_with_schema: "public.orders"
    mode: "GENERATION"
    transformations:
      - columns: [ "status" ]
        params:
          type: "categorical_generator"
          categories:
            type: string
            values:
              - "Shipped"
              - "Cancelled"
          probabilities:
            - 0.8
            - 0.2

table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
safety_mode: "RELAXED"

You omit categories and probabilities, and they will be learned from the original data.