Generation

Software prerequisites

To run the following examples on your machine, you should first:

Install TDK CLI
Download tdk-tutorial-databases repository

git clone https://github.com/synthesized-io/tdk-tutorial-databases

Start PostgreSQL instances and pgAdmin using the following commands:

cd tdk-tutorial-databases/
docker-compose run databases

pgAdmin will be available at http://localhost:8888 with the password postgres.

In order to run TDK with the configurations provided in this tutorial, save them to config.yaml file and execute the following command:

tdk \
    --input-url jdbc:postgresql://localhost:6000/postgres --input-username postgres --input-password postgres \
    --output-url jdbc:postgresql://localhost:6001/postgres --output-username postgres --output-password postgres \
    --config-file config_masking.tdk.yaml

Instead of masking the original data, we can generate new data. This approach might be useful if we need to produce more data than what is available in the source database, for example, for load testing. Let’s proceed to generate some data:

default_config:
  mode: GENERATION
  target_ratio: 2

table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
safety_mode: RELAXED

We can check results using this SQL:

select
    'actor' as "table",
    count(*) as numrows
from actor
union
select
    'address' as "table",
    count(*) as numrows
from address
union
select
    'category' as "table",
    count(*) as numrows
from category
union
select
    'city' as "table",
    count(*) as numrows
from city
union
select
    'country' as "table",
    count(*) as numrows
from country
union
select
    'customer' as "table",
    count(*) as numrows
from customer
union
select
    'film' as "table",
    count(*) as numrows
from film
union
select
    'film_actor' as "table",
    count(*) as numrows
from film_actor
union
select
    'film_category' as "table",
    count(*) as numrows
from film_category
union
select
    'inventory' as "table",
    count(*) as numrows
from inventory
union
select
    'language' as "table",
    count(*) as numrows
from language
union
select
    'payment' as "table",
    count(*) as numrows
from payment
union
select
    'rental' as "table",
    count(*) as numrows
from rental
union
select
    'staff' as "table",
    count(*) as numrows
from staff
union
select
    'store' as "table",
    count(*) as numrows
from store

The result on source database will look approximately like this:

Table 1. Table sizes in the source database
tablename	numrows
actor	200
address	603
category	16
city	600
film	1000
film_actor	5462
payment	16049
rental	16044

After the generation we can see that number of records doubled in each of the table:

Table 2. Table sizes in the generated database
actor	400
address	1206
category	32
city	1200
film	2000
film_actor	10771
payment	32098
rental	32088

Here we can check that all relations in the customer→address→city→country chain work well. The result of this query on the output database should be twice as big as the result on the input database:

select count(*)
from customer
inner join address on customer.address_id = address.address_id
inner join city on address.city_id = city.city_id
inner join country on city.country_id = country.country_id

Again, we can override the default table settings in tables. In this example, we are setting a different target ratio for specific tables:

default_config:
  mode: GENERATION
  target_ratio: 2

tables:
  - table_name_with_schema: "public.film"
    mode: GENERATION
    target_ratio: 3
    transformations:
      - columns: [ "title" ]
        params:
          type: "formatted_string_generator"
          pattern: "[A-Z]{16}"

table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
safety_mode: RELAXED

Some generators are "fake" ones like formatted_string_generator below, which produces values independently of the original data. But some use original data to fit a model and create values following underlying distribution.

Here is an example of using categorical_generator with provided probabilities:

default_config:
  mode: GENERATION
  target_ratio: 2

tables:
  - table_name_with_schema: "public.film"
    mode: GENERATION
    target_ratio: 3
    transformations:
      - columns: [ "rating" ]
        params:
          type: "categorical_generator"
          categories:
            value_source: PROVIDED
            values:
              "PG-13": 223
              "NC-17": 21
              "R": 195
              "PG": 194
              "G": 178

table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
safety_mode: RELAXED

You omit categories and probabilities, and they will be learned from the original data.