Generation

Software prerequisites

To run the following examples on your machine, you should first:

git clone https://github.com/synthesized-io/tdk-tutorial-databases
  • Start PostgreSQL instances and pgAdmin using the following commands:

cd tdk-tutorial-databases/
docker-compose run databases

pgAdmin will be available at http://localhost:8888 with the password postgres.

In order to run TDK with the configurations provided in this tutorial, save them to config.yaml file and execute the following command:

tdk \
    --input-url jdbc:postgresql://localhost:6000/postgres --input-username postgres --input-password postgres \
    --output-url jdbc:postgresql://localhost:6001/postgres --output-username postgres --output-password postgres \
    --config-file config.yaml

Instead of masking the original data, we can generate new data. This approach might be useful if we need to produce more data than what is available in the source database, for example, for load testing. Let’s proceed to generate some data:

default_config:
  mode: GENERATION
  target_ratio: 2

table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
safety_mode: RELAXED

We can check results using this SQL:

select 'actor' as "table", count(1) as numrows from actor
union
select 'address', count(1) from address
union
select 'category', count(1) from category
union
select 'city', count(1) from city
union
select 'country', count(1) from country
union
select 'customer', count(1) from customer
union
select 'film', count(1) from film
union
select 'film_actor', count(1) from film_actor
union
select 'film_category', count(1) from film_category
union
select 'inventory', count(1) from inventory
union
select 'language', count(1) from language
union
select 'payment', count(1) from payment
union
select 'rental', count(1) from rental
union
select 'staff', count(1) from staff
union
select 'store', count(1) from store

The result on source database will look approximately like this:

Table 1. Table sizes in the source database
tablename numrows

actor

200

address

603

category

16

city

600

film

1000

film_actor

5462

payment

16049

rental

16044

After the generation we can see that number of records doubled in each of the table:

Table 2. Table sizes in the generated database
actor 400

address

1206

category

32

city

1200

film

2000

film_actor

10771

payment

32098

rental

32088

Here we can check that all relations in the customer→address→city→country chain work well. The result of this query on the output database should be twice as big as the result on the input database:

select count(*) from customer
  inner join address on customer.address_id = address.address_id
  inner join city on address.city_id = city.city_id
  inner join country on city.country_id = country.country_id

Again, we can override the default table settings in tables. In this example, we are setting a different target ratio for specific tables:

default_config:
  mode: GENERATION
  target_ratio: 2

tables:
  - table_name_with_schema: "public.film"
    mode: GENERATION
    target_ratio: 3
    transformations:
      - columns: [ "title" ]
        params:
          type: "formatted_string_generator"
          pattern: "[A-Z]{16}"

table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
safety_mode: RELAXED
Some generators are "fake" ones like formatted_string_generator below, which produces values independently of the original data. But some use original data to fit a model and create values following underlying distribution.

Here is an example of using categorical_generator with provided probabilities:

default_config:
  mode: GENERATION
  target_ratio: 2

tables:
  - table_name_with_schema: "public.film"
    mode: GENERATION
    target_ratio: 3
    transformations:
      - columns: [ "rating" ]
        params:
          type: "categorical_generator"
          categories:
            type: STRING
            value_source: PROVIDED
            values:
              "PG-13": 223
              "NC-17": 21
              "R": 195
              "PG": 194
              "G": 178

table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
safety_mode: RELAXED

You omit categories and probabilities, and they will be learned from the original data.