Subsetting
Software prerequisites
To run the following examples on your machine, you should first:
-
Install TDK CLI
-
Download tdk-tutorial-databases repository
git clone https://github.com/synthesized-io/tdk-tutorial-databases
-
Start PostgreSQL instances and pgAdmin using the following commands:
cd tdk-tutorial-databases/ docker-compose run databases
pgAdmin will be available at http://localhost:8888 with the password postgres
.
In order to run TDK with the configurations provided in this tutorial, save them to config.yaml
file and execute the following command:
tdk \
--input-url jdbc:postgresql://localhost:6000/postgres --input-username postgres --input-password postgres \
--output-url jdbc:postgresql://localhost:6001/postgres --output-username postgres --output-password postgres \
--config-file config_masking.tdk.yaml
Sometimes we might only need a small subset of the production database. Here’s how we can do it:
default_config:
mode: KEEP
target_ratio: 0.5
table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
We can use the same query we used to check the generation to verify the number of records in the tables of the resulting database:
select
'actor' as "table",
count(*) as numrows
from actor
union
select
'address' as "table",
count(*) as numrows
from address
union
select
'category' as "table",
count(*) as numrows
from category
union
select
'city' as "table",
count(*) as numrows
from city
union
select
'country' as "table",
count(*) as numrows
from country
union
select
'customer' as "table",
count(*) as numrows
from customer
union
select
'film' as "table",
count(*) as numrows
from film
union
select
'film_actor' as "table",
count(*) as numrows
from film_actor
union
select
'film_category' as "table",
count(*) as numrows
from film_category
union
select
'inventory' as "table",
count(*) as numrows
from inventory
union
select
'language' as "table",
count(*) as numrows
from language
union
select
'payment' as "table",
count(*) as numrows
from payment
union
select
'rental' as "table",
count(*) as numrows
from rental
union
select
'staff' as "table",
count(*) as numrows
from staff
union
select
'store' as "table",
count(*) as numrows
from store
As we can see, the number of records is reduced by half. However, for some cases, subsetting yields a lower or even zero number of records as we have to satisfy foreign key constraints:
tablename | numrows |
---|---|
actor |
100 |
addres |
131 |
category |
8 |
city |
280 |
film |
0 |
film_actor |
0 |
payment |
8025 |
rental |
0 |
By using target_ratio
only for specific tables, we can reduce the size of only the largest ones:
default_config:
mode: KEEP
tables:
- table_name_with_schema: "public.rental"
target_ratio: 0.1
- table_name_with_schema: "public.payment"
target_ratio: 0.1
- table_name_with_schema: "public.film"
transformations:
- columns: ["fulltext"]
params:
type: formatted_string_generator
pattern: "\\+44[0-9]{10}"
- columns: ["rating"]
params:
type: "categorical_generator"
categories:
value_source: PROVIDED
values:
"PG-13": 223
"NC-17": 21
"R": 195
"PG": 194
"G": 178
table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
cycle_resolution_strategy: DELETE_NOT_REQUIRED
safety_mode: RELAXED
tablename | numrows |
---|---|
actor |
200 |
address |
603 |
category |
16 |
city |
600 |
film |
1000 |
film_actor |
5462 |
payment |
1605 |
rental |
1605 |