Data Filtering

Software prerequisites

To run the following examples on your machine, you should first:

git clone https://github.com/synthesized-io/tdk-tutorial-databases
  • Start PostgreSQL instances and pgAdmin using the following commands:

cd tdk-tutorial-databases/
docker-compose run databases

pgAdmin will be available at http://localhost:8888 with the password postgres.

In order to run TDK with the configurations provided in this tutorial, save them to config.yaml file and execute the following command:

tdk \
    --input-url jdbc:postgresql://localhost:6000/postgres --input-username postgres --input-password postgres \
    --output-url jdbc:postgresql://localhost:6001/postgres --output-username postgres --output-password postgres \
    --config-file config_masking.tdk.yaml

By default, TDK reads entire tables from a source database. However, we can add a custom data filter for specific tables. For the data filter, we can use any valid SQL constructs, including DBMS-specific ones, such as logical operators, functions, subqueries, and so on. Any SQL expression that can be used as the search condition in a WHERE clause.

Let’s treat real business cases where data filtering might be useful, using the Pagila sample database. For simplicity, we will focus on three related tables rentalinventoryfilm:

Diagram

Filtering by film ID

There are many cases where we may want to hide data belonging to specific entities during transformations. For example, we may want to hide all information about inventories and rentals, except for the "Alabama Devil" film film_id = 9. The source database contains information about the inventory and rental. Only a small fraction of the thousands of records in these tables relate to the "Alabama Devil" film:

SELECT * FROM public.inventory WHERE film_id = 9
inventory_id film_id store_id last_update

41

9

1

2022-02-15 10:09:17+00

42

9

1

2022-02-15 10:09:17+00

43

9

1

2022-02-15 10:09:17+00

44

9

2

2022-02-15 10:09:17+00

45

9

2

2022-02-15 10:09:17+00

SELECT * FROM public.rental
         WHERE inventory_id in (
            SELECT inventory_id FROM public.inventory WHERE film_id = 9
         )
rental_id rental_date inventory_id customer_id return_date staff_id last_update

5745

2022-07-10 11:10:11+00

41

305

2022-07-19 05:56:11+00

1

2022-02-16 02:30:53+00

10036

2022-07-31 17:47:20+00

41

491

2022-08-03 21:53:20+00

1

2022-02-16 02:30:53+00

15802

2022-08-23 13:26:51+00

41

158

2022-08-29 15:28:51+00

2

2022-02-16 02:30:53+00

3880

2022-07-06 17:32:49+00

42

507

2022-07-07 19:46:49+00

2

2022-02-16 02:30:53+00

8837

2022-07-29 21:49:00+00

42

372

2022-08-07 20:56:00+00

2

2022-02-16 02:30:53+00

15639

2022-08-23 07:03:25+00

42

214

2022-08-24 09:21:25+00

2

2022-02-16 02:30:53+00

7835

2022-07-28 07:49:39+00

43

547

2022-08-02 06:16:39+00

2

2022-02-16 02:30:53+00

15776

2022-08-23 12:26:01+00

43

514

2022-08-29 17:17:01+00

1

2022-02-16 02:30:53+00

9442

2022-07-30 20:44:31+00

44

75

2022-08-04 00:29:31+00

1

2022-02-16 02:30:53+00

12096

2022-08-17 21:32:50+00

44

463

2022-08-25 02:33:50+00

1

2022-02-16 02:30:53+00

7288

2022-07-27 11:24:59+00

45

226

2022-08-02 14:52:59+00

2

2022-02-16 02:30:53+00

12027

2022-08-17 19:01:12+00

45

285

2022-08-26 20:08:12+00

2

2022-02-16 02:30:53+00

In this case, we can simply set the filter option for the film table to film_id = 9, which will eventually filter the film table itself and all related tables inventory including transitively related rental.

Example of configuration:

default_config:
  mode: KEEP

tables:
  - table_name_with_schema: "public.film"
    filter: film_id = 9

table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
cycle_resolution_strategy: DELETE_NOT_REQUIRED

As a result, we will transform data about films, inventories and rentals for only one film "Alabama Devil". And we will get this result in the target database:

SELECT * FROM public.film
film_id title description release_year language_id original_language_id rental_duration rental_rate length replacement_cost rating last_update special_features fulltext

9

ALABAMA DEVIL

A Thoughtful Panorama of a Database Administrator And a Mad Scientist who must Outgun a Mad Scientist in A Jet Boat

2006

1

3

2.99

114

21.99

PG-13

2022-09-10 19:46:03.905795+00

{Trailers,Deleted Scenes}

'administr':9 'alabama':1 'boat':23 'databas':8 'devil':2 'jet':22 'mad':12,18 'must':15 'outgun':16 'panorama':5 'scientist':13,19 'thought':4

Filtering by date

There is another popular case during transformation – filtering data in source tables by date. For example, we only want to transform rentals registered before 2022.07.01. And we can set the appropriate filter option for the rental table in the TDK config file:

default_config:
  mode: KEEP

tables:
  - table_name_with_schema: "public.rental"
    filter: rental_date < '2022-07-01'

table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
cycle_resolution_strategy: DELETE_NOT_REQUIRED

Information about rentals in the source database:

SELECT min(rental_date), max(rental_date), count(*)
	FROM public.rental;
min max count

2022-02-14 15:16:03+00

2022-08-23 21:50:12+00

16044

Information about rentals after transformation in the target database:

min max count

2022-02-14 17:16:03+00

2022-06-22 00:48:59+00

3649

Global filtering

In some cases we need to apply filtering globally, to several tables in a database schema at once or to an entire database. And TDK deliberately doesn’t support any kind of global filtering, to avoid the problems associated with the ambiguity of applying the filter to different tables. In other words, we have no a special global filter "hanging in the air" (for example, in the default_config section) that automagically applies to some tables and doesn’t apply to others. However, we support standard YAML constructions like anchors and aliases in the configuration file, that helped define same filtering option for several tables at once.

For example, we have a last_update column in all the tables, and we want to filter all data in desired tables that was updated before January 1st, 2022. We can define a filter for any one table, mark them as YAML anchor (&global_last_update_filter), and bind that value to other tables (using aliases *https://xxx[global_last_update_filter]):

default_config:
  mode: KEEP

tables:
  - table_name_with_schema: "public.film"
    filter: &global_last_update_filter "last_update < '2022-01-01'"
  - table_name_with_schema: "public.film_actor"
    filter: *global_last_update_filter
  - table_name_with_schema: "public.film_category"
    filter: *global_last_update_filter

table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
cycle_resolution_strategy: DELETE_NOT_REQUIRED

Filtering and subsetting

Data filtering can also work in conjunction with subsetting. Imagine that we want to transform only half of the rentals of "Alabama Devil" film. To do this, it is enough to set the filter and target_ratio options for the rentals table:

default_config:
  mode: KEEP

tables:

  - table_name_with_schema: "public.rental"
    filter: inventory_id between 41 and 45
    target_ratio: 0.5

table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
cycle_resolution_strategy: DELETE_NOT_REQUIRED

In source database we have information about rentals of various films, including 12 rentals of "Alabama devil":

SELECT count(*)
	FROM rental
	INNER JOIN inventory ON rental.inventory_id = inventory.inventory_id
	INNER JOIN film ON inventory.film_id = film.film_id
	WHERE film.title = 'ALABAMA DEVIL';

count

12

But in the target database after transformation, we have only half of the rentals of this film and there are no orders for other films:

count

12