Subsetting

Software prerequisites

To run the following examples on your machine, you should first:

git clone https://github.com/synthesized-io/tdk-tutorial-databases
  • Start PostgreSQL instances and pgAdmin using the following commands:

cd tdk-tutorial-databases/
docker-compose run databases

pgAdmin will be available at http://localhost:8888 with the password postgres.

In order to run TDK with the configurations provided in this tutorial, save them to config.yaml file and execute the following command:

tdk \
    --input-url jdbc:postgresql://localhost:6000/postgres --input-username postgres --input-password postgres \
    --output-url jdbc:postgresql://localhost:6001/postgres --output-username postgres --output-password postgres \
    --config-file config_masking.tdk.yaml

Sometimes we might only need a small subset of the production database. Here’s how we can do it:

default_config:
  mode: KEEP
  target_ratio: 0.5

table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS

We can use the same query we used to check the generation to verify the number of records in the tables of the resulting database:

select
    'actor' as "table",
    count(*) as numrows
from actor
union
select
    'address' as "table",
    count(*) as numrows
from address
union
select
    'category' as "table",
    count(*) as numrows
from category
union
select
    'city' as "table",
    count(*) as numrows
from city
union
select
    'country' as "table",
    count(*) as numrows
from country
union
select
    'customer' as "table",
    count(*) as numrows
from customer
union
select
    'film' as "table",
    count(*) as numrows
from film
union
select
    'film_actor' as "table",
    count(*) as numrows
from film_actor
union
select
    'film_category' as "table",
    count(*) as numrows
from film_category
union
select
    'inventory' as "table",
    count(*) as numrows
from inventory
union
select
    'language' as "table",
    count(*) as numrows
from language
union
select
    'payment' as "table",
    count(*) as numrows
from payment
union
select
    'rental' as "table",
    count(*) as numrows
from rental
union
select
    'staff' as "table",
    count(*) as numrows
from staff
union
select
    'store' as "table",
    count(*) as numrows
from store

As we can see, the number of records is reduced by half. However, for some cases, subsetting yields a lower or even zero number of records as we have to satisfy foreign key constraints:

Table 1. Table sizes in the subsetted database
tablename numrows

actor

100

addres

131

category

8

city

280

film

0

film_actor

0

payment

8025

rental

0

By using target_ratio only for specific tables, we can reduce the size of only the largest ones:

default_config:
  mode: KEEP

tables:

  - table_name_with_schema: "public.rental"
    target_ratio: 0.1

  - table_name_with_schema: "public.payment"
    target_ratio: 0.1

  - table_name_with_schema: "public.film"
    transformations:
      - columns: ["fulltext"]
        params:
          type: formatted_string_generator
          pattern: "\\+44[0-9]{10}"
      - columns: ["rating"]
        params:
          type: "categorical_generator"
          categories:
            value_source: PROVIDED
            values:
              "PG-13": 223
              "NC-17": 21
              "R": 195
              "PG": 194
              "G": 178


table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
cycle_resolution_strategy: DELETE_NOT_REQUIRED
safety_mode: RELAXED
Table 2. Table sizes in the database subsetted with an improved configuration
tablename numrows

actor

200

address

603

category

16

city

600

film

1000

film_actor

5462

payment

1605

rental

1605