Data Filtering
Software prerequisites
To run the following examples on your machine, you should first:
-
Install TDK CLI
-
Download tdk-tutorial-databases repository
git clone https://github.com/synthesized-io/tdk-tutorial-databases
-
Start PostgreSQL instances and pgAdmin using the following commands:
cd tdk-tutorial-databases/ docker-compose run databases
pgAdmin will be available at http://localhost:8888 with the password postgres
.
In order to run TDK with the configurations provided in this tutorial, save them to config.yaml
file and execute the following command:
tdk \
--input-url jdbc:postgresql://localhost:6000/postgres --input-username postgres --input-password postgres \
--output-url jdbc:postgresql://localhost:6001/postgres --output-username postgres --output-password postgres \
--config-file config_masking.tdk.yaml
By default, TDK reads entire tables from a source database. However, we can add a custom data filter for specific tables. For the data filter, we can use any valid SQL constructs, including DBMS-specific ones, such as logical operators, functions, subqueries, and so on. Any SQL expression that can be used as the search condition in a WHERE
clause.
Let’s treat real business cases where data filtering might be useful, using the Pagila sample database. For simplicity, we will focus on three related tables rental
→ inventory
→ film
:
Filtering by film ID
There are many cases where we may want to hide data belonging to specific entities during transformations. For example, we may want to hide all information about inventories and rentals, except for the "Alabama Devil" film film_id = 9
. The source database contains information about the inventory
and rental
. Only a small fraction of the thousands of records in these tables relate to the "Alabama Devil" film:
SELECT * FROM public.inventory WHERE film_id = 9
inventory_id | film_id | store_id | last_update |
---|---|---|---|
41 |
9 |
1 |
2022-02-15 10:09:17+00 |
42 |
9 |
1 |
2022-02-15 10:09:17+00 |
43 |
9 |
1 |
2022-02-15 10:09:17+00 |
44 |
9 |
2 |
2022-02-15 10:09:17+00 |
45 |
9 |
2 |
2022-02-15 10:09:17+00 |
SELECT * FROM public.rental
WHERE inventory_id in (
SELECT inventory_id FROM public.inventory WHERE film_id = 9
)
rental_id | rental_date | inventory_id | customer_id | return_date | staff_id | last_update |
---|---|---|---|---|---|---|
5745 |
2022-07-10 11:10:11+00 |
41 |
305 |
2022-07-19 05:56:11+00 |
1 |
2022-02-16 02:30:53+00 |
10036 |
2022-07-31 17:47:20+00 |
41 |
491 |
2022-08-03 21:53:20+00 |
1 |
2022-02-16 02:30:53+00 |
15802 |
2022-08-23 13:26:51+00 |
41 |
158 |
2022-08-29 15:28:51+00 |
2 |
2022-02-16 02:30:53+00 |
3880 |
2022-07-06 17:32:49+00 |
42 |
507 |
2022-07-07 19:46:49+00 |
2 |
2022-02-16 02:30:53+00 |
8837 |
2022-07-29 21:49:00+00 |
42 |
372 |
2022-08-07 20:56:00+00 |
2 |
2022-02-16 02:30:53+00 |
15639 |
2022-08-23 07:03:25+00 |
42 |
214 |
2022-08-24 09:21:25+00 |
2 |
2022-02-16 02:30:53+00 |
7835 |
2022-07-28 07:49:39+00 |
43 |
547 |
2022-08-02 06:16:39+00 |
2 |
2022-02-16 02:30:53+00 |
15776 |
2022-08-23 12:26:01+00 |
43 |
514 |
2022-08-29 17:17:01+00 |
1 |
2022-02-16 02:30:53+00 |
9442 |
2022-07-30 20:44:31+00 |
44 |
75 |
2022-08-04 00:29:31+00 |
1 |
2022-02-16 02:30:53+00 |
12096 |
2022-08-17 21:32:50+00 |
44 |
463 |
2022-08-25 02:33:50+00 |
1 |
2022-02-16 02:30:53+00 |
7288 |
2022-07-27 11:24:59+00 |
45 |
226 |
2022-08-02 14:52:59+00 |
2 |
2022-02-16 02:30:53+00 |
12027 |
2022-08-17 19:01:12+00 |
45 |
285 |
2022-08-26 20:08:12+00 |
2 |
2022-02-16 02:30:53+00 |
In this case, we can simply set the filter
option for the film
table to film_id = 9
, which will eventually filter the film
table itself and all related tables inventory
including transitively related rental
.
Example of configuration:
default_config:
mode: KEEP
tables:
- table_name_with_schema: "public.film"
filter: film_id = 9
table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
cycle_resolution_strategy: DELETE_NOT_REQUIRED
As a result, we will transform data about films, inventories and rentals for only one film "Alabama Devil". And we will get this result in the target database:
SELECT * FROM public.film
film_id | title | description | release_year | language_id | original_language_id | rental_duration | rental_rate | length | replacement_cost | rating | last_update | special_features | fulltext |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9 |
ALABAMA DEVIL |
A Thoughtful Panorama of a Database Administrator And a Mad Scientist who must Outgun a Mad Scientist in A Jet Boat |
2006 |
1 |
3 |
2.99 |
114 |
21.99 |
PG-13 |
2022-09-10 19:46:03.905795+00 |
{Trailers,Deleted Scenes} |
'administr':9 'alabama':1 'boat':23 'databas':8 'devil':2 'jet':22 'mad':12,18 'must':15 'outgun':16 'panorama':5 'scientist':13,19 'thought':4 |
Filtering by date
There is another popular case during transformation – filtering data in source tables by date. For example, we only want to transform rentals registered before 2022.07.01. And we can set the appropriate filter
option for the rental
table in the TDK config file:
default_config:
mode: KEEP
tables:
- table_name_with_schema: "public.rental"
filter: rental_date < '2022-07-01'
table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
cycle_resolution_strategy: DELETE_NOT_REQUIRED
Information about rentals in the source database:
SELECT min(rental_date), max(rental_date), count(*)
FROM public.rental;
min | max | count |
---|---|---|
2022-02-14 15:16:03+00 |
2022-08-23 21:50:12+00 |
16044 |
Information about rentals after transformation in the target database:
min | max | count |
---|---|---|
2022-02-14 17:16:03+00 |
2022-06-22 00:48:59+00 |
3649 |
Global filtering
In some cases we need to apply filtering globally, to several tables in a database schema at once or to an entire database. And TDK deliberately doesn’t support any kind of global filtering, to avoid the problems associated with the ambiguity of applying the filter to different tables. In other words, we have no a special global filter "hanging in the air" (for example, in the default_config
section) that automagically applies to some tables and doesn’t apply to others. However, we support standard YAML constructions like anchors and aliases in the configuration file, that helped define same filtering option for several tables at once.
For example, we have a last_update
column in all the tables, and we want to filter all data in desired tables that was updated before January 1st, 2022. We can define a filter for any one table, mark them as YAML anchor (&global_last_update_filter
), and bind that value to other tables (using aliases *https://xxx[global_last_update_filter]
):
default_config:
mode: KEEP
tables:
- table_name_with_schema: "public.film"
filter: &global_last_update_filter "last_update < '2022-01-01'"
- table_name_with_schema: "public.film_actor"
filter: *global_last_update_filter
- table_name_with_schema: "public.film_category"
filter: *global_last_update_filter
table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
cycle_resolution_strategy: DELETE_NOT_REQUIRED
Filtering and subsetting
Data filtering can also work in conjunction with subsetting. Imagine that we want to transform only half of the rentals of "Alabama Devil" film. To do this, it is enough to set the filter
and target_ratio
options for the rentals
table:
default_config:
mode: KEEP
tables:
- table_name_with_schema: "public.rental"
filter: inventory_id between 41 and 45
target_ratio: 0.5
table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
cycle_resolution_strategy: DELETE_NOT_REQUIRED
In source database we have information about rentals of various films, including 12 rentals of "Alabama devil":
SELECT count(*)
FROM rental
INNER JOIN inventory ON rental.inventory_id = inventory.inventory_id
INNER JOIN film ON inventory.film_id = film.film_id
WHERE film.title = 'ALABAMA DEVIL';
count |
12 |
But in the target database after transformation, we have only half of the rentals of this film and there are no orders for other films:
count |
12 |