Data Filtering
By default, TDK reads entire tables (all rows) from a source DB. But we can add a custom data filter for specific tables.
For data filter we can use any valid SQL-constructs (including DBMS-specific ones), such as logical operators, functions, subqueries and so on (any SQL expression that can be used as the search condition in WHERE
clause).
Let’s treat real business cases where data filtering might be useful, using the Northwind sample database.
For simplicity, we will focus on three related tables - Shippers
→ Orders
→ Order Details
:

Filtering by organization ID
There are a lot of cases where we want to hide some data belonging to the specific organizations during the transformation.
Imagine that we want to hide all information about shippers and their orders, except for the "Federal Shipping" company (with ID=3).
The source DB contains information about 3 shippers (Shippers
table), their orders (Orders
table) and details (Order Details
table).
Detailed statistics on shippers and orders are presented in the table below.
ShipperID |
CompanyName |
orders_count |
order_details_count |
1 |
Speedy Express |
249 |
646 |
2 |
United Package |
326 |
864 |
3 |
Federal Shipping |
255 |
645 |
Corresponding SQL query
SELECT
ShipperID,
CompanyName,
(SELECT COUNT(1)
FROM Orders o
WHERE o.ShipVia = s.ShipperID) AS orders_count,
(SELECT COUNT(1)
FROM Orders o
LEFT JOIN [Order Details] d ON o.OrderID = d.OrderID
WHERE o.ShipVia = s.ShipperID) AS order_details_count
FROM Shippers s
In this case, we can simply set the filter
option for the Shippers
table to ShipperID = 3
, which will eventually filter the Shippers
table itself and all related tables (Orders
) including transitively related (Order Details
).
Example of configuration:
default_config:
mode: "KEEP"
tables:
- table_name_with_schema: "dbo.Shippers"
filter: ShipperID = 3
table_truncation_mode: "TRUNCATE"
schema_creation_mode: "CREATE_IF_NOT_EXISTS"
cycle_resolution_strategy: "DELETE_NOT_REQUIRED"
As a result, we will transform data about shippers and orders for only one company - "Federal Shipping". And we will get this result in the target DB:
ShipperID |
CompanyName |
orders_count |
order_details_count |
3 |
Federal Shipping |
255 |
645 |
Filtering by date
There is another popular case during transformation - filtering data in source tables by date. For example - we only want to transform orders registered before 1997.08.08. And we can set the appropriate filter
option for the Orders
table in the TDK config file:
default_config:
mode: "KEEP"
tables:
- table_name_with_schema: "dbo.Orders"
filter: OrderDate < CONVERT(datetime, '1997-08-08')
table_truncation_mode: "TRUNCATE"
schema_creation_mode: "CREATE_IF_NOT_EXISTS"
cycle_resolution_strategy: "DELETE_NOT_REQUIRED"
Information about orders in the source DB:
min_order_date | max_order_date | orders_count | order_details_count |
---|---|---|---|
1996-07-04 |
1998-05-06 |
830 |
2155 |
Information about orders after transformation in the target DB:
min_order_date | max_order_date | orders_count | order_details_count |
---|---|---|---|
1996-07-04 |
1997-08-07 |
377 |
997 |
Corresponding SQL query
SELECT
MIN(o.OrderDate) as min_order_date,
MAX(o.OrderDate) as max_order_date,
COUNT(1) as orders_count,
(
SELECT
COUNT(1)
FROM
[Order Details]) as order_details_count
FROM
orders o
Global filtering
In some cases we need to apply filtering globally, to several tables in a database schema at once or to an entire database. And TDK deliberately doesn’t support any kind of global filtering, to avoid the problems associated with the ambiguity of applying the filter to different tables. In other words, we have no a special global filter "hanging in the air" (for example, in the default_config
section) that automagically applies to some tables and doesn’t apply to others. However, we support standard YAML constructions like anchors and aliases in the configuration file, that helped define same filtering option for several tables at once.
For example, we have a region
column in several tables (Suppliers
, Employees
, Customers
), and we want to filter all data in these tables by WA
region. We can define a filter for any one table, mark them as YAML anchor (&global_region_filter
), and bind that value to other tables (using aliases *global_region_filter
):
default_config:
mode: "KEEP"
tables:
- table_name_with_schema: "dbo.Suppliers"
filter: &global_region_filter "region = 'WA'"
- table_name_with_schema: "dbo.Employees"
filter: *global_region_filter
- table_name_with_schema: "dbo.Customers"
filter: *global_region_filter
table_truncation_mode: "TRUNCATE"
schema_creation_mode: "CREATE_IF_NOT_EXISTS"
cycle_resolution_strategy: "DELETE_NOT_REQUIRED"
Filtering and subsetting
Data filtering can also work in conjunction with subsetting. Imagine that we want to transform only half of the orders from the USA. To do this, it is enough to set the filter
and target_ratio
options for the Orders
table:
default_config:
mode: "KEEP"
tables:
- table_name_with_schema: "dbo.Orders"
filter: ShipCountry = 'USA'
target_ratio: 0.5
table_truncation_mode: "TRUNCATE"
schema_creation_mode: "CREATE_IF_NOT_EXISTS"
cycle_resolution_strategy: "DELETE_NOT_REQUIRED"
In source DB we have information about orders from different countries (including 122 orders from the USA):
ShipCountry |
orders_count |
order_details_count |
Other countries |
708 |
1803 |
USA |
122 |
352 |
But in the target database after transformation, we have only half of the orders from the USA (and the corresponding rows from the Order Details
table) and there are no orders from other countries:
ShipCountry |
orders_count |
order_details_count |
USA |
61 |
193 |
Corresponding SQL query
SELECT
IIF(ShipCountry = 'USA', ShipCountry, 'Other countries') as ShipCountry,
COUNT(DISTINCT o.OrderID) AS orders_count,
COUNT(1) AS order_details_count
FROM
[Order Details] d
JOIN Orders o ON
o.OrderID = d.OrderID
GROUP BY
IIF(ShipCountry = 'USA', ShipCountry, 'Other countries')