Subsetting

Software prerequisites

To run the following examples on your machine, you should first:

git clone https://github.com/synthesized-io/tdk-tutorial-databases
  • Start PostgreSQL instances and pgAdmin using the following commands:

cd tdk-tutorial-databases/
docker-compose run databases

pgAdmin will be available at http://localhost:8888 with the password postgres.

In order to run TDK with the configurations provided in this tutorial, save them to config.yaml file and execute the following command:

tdk \
    --input-url jdbc:postgresql://localhost:6000/postgres --input-username postgres --input-password postgres \
    --output-url jdbc:postgresql://localhost:6001/postgres --output-username postgres --output-password postgres \
    --config-file config_masking.tdk.yaml

Sometimes we might only need a small subset of the production database. Here’s how we can do it:

default_config:
  mode: KEEP
  target_ratio: 0.5

table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS

We can use the same query we used to check the generation to verify the number of records in the tables of the resulting database:

select 'actor' as "table", count(1) as numrows from actor
union
select 'address', count(1) from address
union
select 'category', count(1) from category
union
select 'city', count(1) from city
union
select 'country', count(1) from country
union
select 'customer', count(1) from customer
union
select 'film', count(1) from film
union
select 'film_actor', count(1) from film_actor
union
select 'film_category', count(1) from film_category
union
select 'inventory', count(1) from inventory
union
select 'language', count(1) from language
union
select 'payment', count(1) from payment
union
select 'rental', count(1) from rental
union
select 'staff', count(1) from staff
union
select 'store', count(1) from store

As we can see, the number of records is reduced by half. However, for some cases, subsetting yields a lower or even zero number of records as we have to satisfy foreign key constraints:

Table 1. Table sizes in the subsetted database
tablename numrows

actor

100

addres

131

category

8

city

280

film

0

film_actor

0

payment

8025

rental

0

By using target_ratio only for specific tables, we can reduce the size of only the largest ones:

default_config:
  mode: KEEP

tables:

  - table_name_with_schema: "public.rental"
    target_ratio: 0.1

  - table_name_with_schema: "public.payment"
    target_ratio: 0.1

  - table_name_with_schema: "public.film"
    transformations:
      - columns: ["fulltext"]
        params:
          type: formatted_string_generator
          pattern: "\\+44[0-9]{10}"
      - columns: ["rating"]
        params:
          type: "categorical_generator"
          categories:
            value_source: PROVIDED
            values:
              "PG-13": 223
              "NC-17": 21
              "R": 195
              "PG": 194
              "G": 178


table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
cycle_resolution_strategy: DELETE_NOT_REQUIRED
safety_mode: RELAXED
Table 2. Table sizes in the database subsetted with an improved configuration
tablename numrows

actor

200

address

603

category

16

city

600

film

1000

film_actor

5462

payment

1605

rental

1605