Subsetting

Software prerequisites

To run the following examples on your machine, you should first:

Install TDK CLI
Download tdk-tutorial-databases repository

git clone https://github.com/synthesized-io/tdk-tutorial-databases

Start PostgreSQL instances and pgAdmin using the following commands:

cd tdk-tutorial-databases/
docker-compose run databases

pgAdmin will be available at http://localhost:8888 with the password postgres.

In order to run TDK with the configurations provided in this tutorial, save them to config.yaml file and execute the following command:

tdk \
    --input-url jdbc:postgresql://localhost:6000/postgres --input-username postgres --input-password postgres \
    --output-url jdbc:postgresql://localhost:6001/postgres --output-username postgres --output-password postgres \
    --config-file config_masking.tdk.yaml

Sometimes we might only need a small subset of the production database. Here’s how we can do it:

default_config:
  mode: KEEP
  target_ratio: 0.5

table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS

We can use the same query we used to check the generation to verify the number of records in the tables of the resulting database:

select 'actor' as "table", count(1) as numrows from actor
union
select 'address', count(1) from address
union
select 'category', count(1) from category
union
select 'city', count(1) from city
union
select 'country', count(1) from country
union
select 'customer', count(1) from customer
union
select 'film', count(1) from film
union
select 'film_actor', count(1) from film_actor
union
select 'film_category', count(1) from film_category
union
select 'inventory', count(1) from inventory
union
select 'language', count(1) from language
union
select 'payment', count(1) from payment
union
select 'rental', count(1) from rental
union
select 'staff', count(1) from staff
union
select 'store', count(1) from store

As we can see, the number of records is reduced by half. However, for some cases, subsetting yields a lower or even zero number of records as we have to satisfy foreign key constraints:

Table 1. Table sizes in the subsetted database
tablename	numrows
actor	100
addres	131
category	8
city	280
film	0
film_actor	0
payment	8025
rental	0

By using target_ratio only for specific tables, we can reduce the size of only the largest ones:

default_config:
  mode: KEEP

tables:

  - table_name_with_schema: "public.rental"
    target_ratio: 0.1

  - table_name_with_schema: "public.payment"
    target_ratio: 0.1

  - table_name_with_schema: "public.film"
    transformations:
      - columns: ["fulltext"]
        params:
          type: formatted_string_generator
          pattern: "\\+44[0-9]{10}"
      - columns: ["rating"]
        params:
          type: "categorical_generator"
          categories:
            value_source: PROVIDED
            values:
              "PG-13": 223
              "NC-17": 21
              "R": 195
              "PG": 194
              "G": 178


table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
cycle_resolution_strategy: DELETE_NOT_REQUIRED
safety_mode: RELAXED

Table 2. Table sizes in the database subsetted with an improved configuration
tablename	numrows
actor	200
address	603
category	16
city	600
film	1000
film_actor	5462
payment	1605
rental	1605