Subsetting
Software prerequisites
To run the following examples on your machine, you should first:
-
Install TDK CLI
-
Download tdk-tutorial-databases repository
git clone https://github.com/synthesized-io/tdk-tutorial-databases
-
Start PostgreSQL instances and pgAdmin using the following commands:
cd tdk-tutorial-databases/ docker-compose run databases
pgAdmin will be available at http://localhost:8888 with the password postgres
.
In order to run TDK with the configurations provided in this tutorial, save them to config.yaml
file and execute the following command:
tdk \
--input-url jdbc:postgresql://localhost:6000/postgres --input-username postgres --input-password postgres \
--output-url jdbc:postgresql://localhost:6001/postgres --output-username postgres --output-password postgres \
--config-file config_masking.tdk.yaml
Sometimes we might only need a small subset of the production database. Here’s how we can do it:
default_config:
mode: KEEP
target_ratio: 0.5
table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
We can use the same query we used to check the generation to verify the number of records in the tables of the resulting database:
select 'actor' as "table", count(1) as numrows from actor
union
select 'address', count(1) from address
union
select 'category', count(1) from category
union
select 'city', count(1) from city
union
select 'country', count(1) from country
union
select 'customer', count(1) from customer
union
select 'film', count(1) from film
union
select 'film_actor', count(1) from film_actor
union
select 'film_category', count(1) from film_category
union
select 'inventory', count(1) from inventory
union
select 'language', count(1) from language
union
select 'payment', count(1) from payment
union
select 'rental', count(1) from rental
union
select 'staff', count(1) from staff
union
select 'store', count(1) from store
As we can see, the number of records is reduced by half. However, for some cases, subsetting yields a lower or even zero number of records as we have to satisfy foreign key constraints:
tablename | numrows |
---|---|
actor |
100 |
addres |
131 |
category |
8 |
city |
280 |
film |
0 |
film_actor |
0 |
payment |
8025 |
rental |
0 |
By using target_ratio
only for specific tables, we can reduce the size of only the largest ones:
default_config:
mode: KEEP
tables:
- table_name_with_schema: "public.rental"
target_ratio: 0.1
- table_name_with_schema: "public.payment"
target_ratio: 0.1
- table_name_with_schema: "public.film"
transformations:
- columns: ["fulltext"]
params:
type: formatted_string_generator
pattern: "\\+44[0-9]{10}"
- columns: ["rating"]
params:
type: "categorical_generator"
categories:
value_source: PROVIDED
values:
"PG-13": 223
"NC-17": 21
"R": 195
"PG": 194
"G": 178
table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
cycle_resolution_strategy: DELETE_NOT_REQUIRED
safety_mode: RELAXED
tablename | numrows |
---|---|
actor |
200 |
address |
603 |
category |
16 |
city |
600 |
film |
1000 |
film_actor |
5462 |
payment |
1605 |
rental |
1605 |