Data Filtering

By default, TDK reads entire tables (all rows) from a source DB. But we can add a custom data filter for specific tables. For data filter we can use any valid SQL-constructs (including DBMS-specific ones), such as logical operators, functions, subqueries and so on (any SQL expression that can be used as the search condition in WHERE clause).

Let’s treat real business cases where data filtering might be useful, using the Northwind sample database.

For simplicity, we will focus on three related tables - ShippersOrdersOrder Details:

northwind filtering

Filtering by organization ID

There are a lot of cases where we want to hide some data belonging to the specific organizations during the transformation.

Imagine that we want to hide all information about shippers and their orders, except for the "Federal Shipping" company (with ID=3). The source DB contains information about 3 shippers (Shippers table), their orders (Orders table) and details (Order Details table).

Detailed statistics on shippers and orders are presented in the table below.

ShipperID

CompanyName

orders_count

order_details_count

1

Speedy Express

249

646

2

United Package

326

864

3

Federal Shipping

255

645

Corresponding SQL query
SELECT
    ShipperID,
    CompanyName,
    (SELECT COUNT(1)
     FROM Orders o
     WHERE o.ShipVia = s.ShipperID) AS orders_count,
    (SELECT COUNT(1)
     FROM Orders o
     LEFT JOIN [Order Details] d ON o.OrderID = d.OrderID
    WHERE o.ShipVia = s.ShipperID) AS order_details_count
FROM Shippers s

In this case, we can simply set the filter option for the Shippers table to ShipperID = 3, which will eventually filter the Shippers table itself and all related tables (Orders) including transitively related (Order Details).

Example of configuration:

default_config:
  mode: "KEEP"

tables:
  - table_name_with_schema: "dbo.Shippers"
    filter: ShipperID = 3

table_truncation_mode: "TRUNCATE"
schema_creation_mode: "CREATE_IF_NOT_EXISTS"
cycle_resolution_strategy: "DELETE_NOT_REQUIRED"

As a result, we will transform data about shippers and orders for only one company - "Federal Shipping". And we will get this result in the target DB:

ShipperID

CompanyName

orders_count

order_details_count

3

Federal Shipping

255

645

Filtering by date

There is another popular case during transformation - filtering data in source tables by date. For example - we only want to transform orders registered before 1997.08.08. And we can set the appropriate filter option for the Orders table in the TDK config file:

default_config:
  mode: "KEEP"

tables:
  - table_name_with_schema: "dbo.Orders"
    filter: OrderDate < CONVERT(datetime, '1997-08-08')

table_truncation_mode: "TRUNCATE"
schema_creation_mode: "CREATE_IF_NOT_EXISTS"
cycle_resolution_strategy: "DELETE_NOT_REQUIRED"

Information about orders in the source DB:

min_order_date max_order_date orders_count order_details_count

1996-07-04

1998-05-06

830

2155

Information about orders after transformation in the target DB:

min_order_date max_order_date orders_count order_details_count

1996-07-04

1997-08-07

377

997

Corresponding SQL query
SELECT
    MIN(o.OrderDate) as min_order_date,
    MAX(o.OrderDate) as max_order_date,
    COUNT(1) as orders_count,
    (
    SELECT
        COUNT(1)
    FROM
        [Order Details]) as order_details_count
FROM
    orders o

Global filtering

In some cases we need to apply filtering globally, to several tables in a database schema at once or to an entire database. And TDK deliberately doesn’t support any kind of global filtering, to avoid the problems associated with the ambiguity of applying the filter to different tables. In other words, we have no a special global filter "hanging in the air" (for example, in the default_config section) that automagically applies to some tables and doesn’t apply to others. However, we support standard YAML constructions like anchors and aliases in the configuration file, that helped define same filtering option for several tables at once.

For example, we have a region column in several tables (Suppliers, Employees, Customers), and we want to filter all data in these tables by WA region. We can define a filter for any one table, mark them as YAML anchor (&global_region_filter), and bind that value to other tables (using aliases *global_region_filter):

default_config:
  mode: "KEEP"

tables:
  - table_name_with_schema: "dbo.Suppliers"
    filter: &global_region_filter "region = 'WA'"
  - table_name_with_schema: "dbo.Employees"
    filter: *global_region_filter
  - table_name_with_schema: "dbo.Customers"
    filter: *global_region_filter

table_truncation_mode: "TRUNCATE"
schema_creation_mode: "CREATE_IF_NOT_EXISTS"
cycle_resolution_strategy: "DELETE_NOT_REQUIRED"

Filtering and subsetting

Data filtering can also work in conjunction with subsetting. Imagine that we want to transform only half of the orders from the USA. To do this, it is enough to set the filter and target_ratio options for the Orders table:

default_config:
  mode: "KEEP"

tables:
  - table_name_with_schema: "dbo.Orders"
    filter: ShipCountry = 'USA'
    target_ratio: 0.5

table_truncation_mode: "TRUNCATE"
schema_creation_mode: "CREATE_IF_NOT_EXISTS"
cycle_resolution_strategy: "DELETE_NOT_REQUIRED"

In source DB we have information about orders from different countries (including 122 orders from the USA):

ShipCountry

orders_count

order_details_count

Other countries

708

1803

USA

122

352

But in the target database after transformation, we have only half of the orders from the USA (and the corresponding rows from the Order Details table) and there are no orders from other countries:

ShipCountry

orders_count

order_details_count

USA

61

193

Corresponding SQL query
SELECT
	IIF(ShipCountry = 'USA', ShipCountry, 'Other countries') as ShipCountry,
	COUNT(DISTINCT o.OrderID) AS orders_count,
	COUNT(1) AS order_details_count
FROM
	[Order Details] d
JOIN Orders o ON
	o.OrderID = d.OrderID
GROUP BY
	IIF(ShipCountry = 'USA', ShipCountry, 'Other countries')