[experimental] Incremental Masking

This feature is experimental and may change in future releases.

Incremental masking is a feature that allows you to apply masking rules to data in stages, rather than all at once. This is particularly useful when dealing with large tables that need to be kept up-to-date without having to reprocess the entire dataset.

If you want to jump right in, follow the Incremental Masking Quickstart guide.

Configuring Incremental Masking

To configure incremental masking, you need to set the checkpoint with mode: incremental in the configuration for each table you want to mask incrementally:

checkpoint:
  mode: incremental
  column_name: <column_name>
  source:
    type: last_value

TDK filters rows where <column_name> is greater than <last_value> and updates the <last_value> to be used in the subsequent runs. Use a monotonic, indexed not nullable column, e.g., created_at or an auto-incrementing id.

Using non-monotonic columns such as updated_at can lead to performance overhead or duplication. Using a nullable column will skip rows with NULL values, potentially causing data loss. Using a non-indexed column can lead to performance issues.

The following table summarizes the available checkpoint modes, their behaviors, and constraints:

Mode Description Constraints

Mode	Description	Constraints
`incremental`	• Processes only new rows based on the checkpoint column • Most efficient for large datasets with regular updates • Supports two `source` types: `last_value` (uses the last processed value as checkpoint) and `constant` (uses a fixed value as starting point - value must be valid SQL)	• Requires monotonically increasing column (timestamp, auto-incrementing ID) • Only supports `MASKING` and `KEEP` transformation modes
`refresh`	• Default mode • Reprocesses the entire table on each run • Does not update row if it already exists, relying on unique constraint to identify the same row. • Use when the table structure changes frequently or when incremental processing is not suitable	• `target_ratio` should be set to `1.0` or `0.0` • Only supports `MASKING` and `KEEP` transformation modes
`ignore_table`	• Skips processing the table entirely after the first run, which is similar to setting `target_ratio` to `0.0` • Recommended for tables that are rarely updated to avoid unnecessary processing overhead • Best practice: keep this mode in the `default_config` section	• No special constraints

incremental

• Processes only new rows based on the checkpoint column
• Most efficient for large datasets with regular updates
• Supports two source types: last_value (uses the last processed value as checkpoint) and constant (uses a fixed value as starting point - value must be valid SQL)

• Requires monotonically increasing column (timestamp, auto-incrementing ID)
• Only supports MASKING and KEEP transformation modes

refresh

• Default mode
• Reprocesses the entire table on each run
• Does not update row if it already exists, relying on unique constraint to identify the same row.
• Use when the table structure changes frequently or when incremental processing is not suitable

• target_ratio should be set to 1.0 or 0.0
• Only supports MASKING and KEEP transformation modes

ignore_table

• Skips processing the table entirely after the first run, which is similar to setting target_ratio to 0.0
• Recommended for tables that are rarely updated to avoid unnecessary processing overhead
• Best practice: keep this mode in the default_config section

• No special constraints

Additionally, the table_truncation_mode: IGNORE must be set in the configuration file to ensure that the table is not truncated before each run. During incremental runs, avoid CREATE/DROP_AND_CREATE to prevent data loss. For the first run, you can set schema_creation_mode: CREATE_IF_NOT_EXISTS which could be safely changed to schema_creation_mode: DO_NOT_CREATE for subsequent runs.

You can also add a checkpoint section under default_config to apply it to all tables without a specific checkpoint configuration. We highly recommend using the checkpoint with mode: ignore_table option in the default_config section to avoid processing the same tables multiple times for tables that are not updated frequently. This will transform all tables regularly during the first run, while ignoring tables that do not override the checkpoint section on the subsequent runs. Usage of checkpoint with mode: refresh option is default and will cause TDK to reprocess the entire table in each run, which may not be efficient for large datasets. Use it only if necessary.

Example Configuration

In the example below, we will mask a pair of tables called customer and payment, where the payment table has a foreign key reference to the customer table and a payment_date column that will be used as the checkpoint for incremental masking.

Table 1. `payment` table in the source database before the **first** run
payment_id	customer_id	amount	payment_date
16053	269	0.99	2025-07-02 07:29:53.301
16054	269	4.99	2025-07-02 23:51:40.813
16055	269	2.99	2025-07-20 19:54:02.174
16056	270	1.99	2025-07-23 05:49:30.663
16057	270	4.99	2025-07-24 13:27:17.752

With configuration below, in the first run, TDK will process both the payment and customer tables, remembering the last processed value in payment_date column.

default_config:
  mode: MASKING
  checkpoint:
    mode: ignore_table

tables:
  - table_name_with_schema: "public.payment"
    mode: MASKING
    checkpoint:
      mode: incremental
      column_name: payment_date
      source:
        type: last_value

table_truncation_mode: IGNORE
schema_creation_mode: DO_NOT_CREATE

For a subsequent run, TDK will only process the rows in the payment table that have payment_date greater than the maximum payment_date processed in the previous run. TDK will ignore the customer table and all other tables without specified checkpoint section. This allows us to keep the masked data up-to-date without having to reprocess all the data.

Suppose we run the transformation once again, after adding three new rows to the payment table and one new row with customer_id = 281 to the customer table.

Table 2. `payment` table in the source database before the **second** run
payment_id	customer_id	amount	payment_date
16053	269	0.99	2025-07-02 07:29:53.301
16054	269	4.99	2025-07-02 23:51:40.813
16055	269	2.99	2025-07-20 19:54:02.174
16056	270	1.99	2025-07-23 05:49:30.663
16057	270	4.99	2025-07-24 13:27:17.752
16058	262	8.99	2025-08-12 10:19:47.019
16059	253	0.99	2025-08-19 17:32:36.628
16060	281	0.99	2025-08-20 02:45:42.908

In this case, TDK will only process the two new rows with payment_id 16058 and 16059, as they have payment_date values greater than the maximum payment_date processed in the previous run. The customer table will be ignored as the checkpoint.mode is set to ignore_table. The row with payment_id 16060 will not be processed as it has a reference to the customer table with customer_id 281 which was ignored during this run.

What if I want to update other tables?

If other tables are updated, you can still use the incremental masking. With the previous configuration, TDK will only process the rows in the payment table that have a payment_date greater than the maximum payment_date processed in the previous run, and it will ignore the customer table as specified by the checkpoint.mode: ignore_table option. This will result in ignoring any new rows in the payment table that reference new rows in the customer table.

To address this, you can set the checkpoint in mode incremental for the customer table, similar to the payment table:

default_config:
  mode: MASKING
  checkpoint:
    mode: ignore_table

tables:
  - table_name_with_schema: "public.payment"
    mode: MASKING
    checkpoint:
      mode: incremental
      column_name: payment_date
      source:
        type: last_value

  - table_name_with_schema: "public.customer"
    mode: MASKING
    checkpoint:
      mode: incremental
      column_name: id
      source:
        type: last_value


table_truncation_mode: IGNORE
schema_creation_mode: DO_NOT_CREATE

In this case, TDK will track both the maximum payment_date in the payment table and the maximum id in the customer table. If new rows are added to the customer table, TDK will process them in the next run, ensuring that all related data is masked correctly.

If the customer table does not have a monotonically increasing column, you can use the checkpoint.mode: refresh option for the customer table, which will cause TDK to reprocess the entire table in each run, while keeping incremental processing for payment table. This is useful if the customer table is updated frequently, and you want to ensure that all data is masked correctly.

The checkpoint.mode: refresh option will cause TDK to reprocess the entire table in each run, which may not be efficient for large datasets. Use it only if necessary.

default_config:
  mode: MASKING
  checkpoint:
    mode: ignore_table

tables:
  - table_name_with_schema: "public.payment"
    mode: MASKING
    checkpoint:
      mode: incremental
      column_name: payment_date
      source:
        type: last_value

  - table_name_with_schema: "public.customer"
    mode: MASKING
    target_ratio: 1.0
    checkpoint:
      mode: refresh

table_truncation_mode: IGNORE
schema_creation_mode: DO_NOT_CREATE

What if I want to start from scratch?

If you want to start from scratch and reprocess all the data in the tables, you can set both flag DO_NOT_CONTINUE_FROM_CHECKPOINT and either table_truncation_mode to TRUNCATE or schema_creation_mode to DROP_AND_CREATE in the configuration file. This will cause TDK to ignore the checkpoint, get rid of the existing data in the output tables, and remember the maximum values in the checkpoint columns.

First, you need to execute a clear TDK run using the settings

default_config:
  mode: MASKING
  checkpoint:
    mode: ignore_table

tables:
  - table_name_with_schema: "public.payment"
    mode: MASKING
    checkpoint:
      mode: incremental
      column_name: payment_date
      source:
        type: last_value

table_truncation_mode: IGNORE
schema_creation_mode: DROP_AND_CREATE
flags:
  - DO_NOT_CONTINUE_FROM_CHECKPOINT

To continue with the incremental runs, you can remove the DO_NOT_CONTINUE_FROM_CHECKPOINT flag and use the same configuration as before:

default_config:
  mode: MASKING
  checkpoint:
    mode: ignore_table

tables:
  - table_name_with_schema: "public.payment"
    mode: MASKING
    checkpoint:
      mode: incremental
      column_name: payment_date
      source:
        type: last_value

table_truncation_mode: IGNORE
schema_creation_mode: DO_NOT_CREATE

What if I want to manage the checkpoint values myself?

If you want to have more control over the checkpoint values, you can use the constant source type in the checkpoint configuration. This allows you to specify a fixed value start_from as the starting point for incremental processing. In this case, TDK will not remember the last processed value, and you will need to update the value manually for each run.

The configuration would look like this:

default_config:
  mode: MASKING
  checkpoint:
    mode: ignore_table

tables:
  - table_name_with_schema: "public.payment"
    mode: MASKING
    checkpoint:
      mode: incremental
      column_name: payment_date
      source:
        type: constant
        start_from: "2025-08-08T00:00"

table_truncation_mode: IGNORE
schema_creation_mode: DO_NOT_CREATE

In this example, TDK will process rows in the payment table where payment_date is greater than 2025-08-08 00:00:00. You will need to update the start_from field in the configuration file for each run to ensure that only new rows are processed.

On subsetting

It is possible to combine incrementality with subsetting. The most straightforward way is to use filter field in the table configuration to limit the rows processed.

If you want to maintain a subset of the data while using incremental masking, you can set up a filter that works in conjunction with the incrementality. For example, you can filter rows based on a date range and use the checkpoint to process only new rows within that range.

The following configuration will maintain a subset of the payment table where payment_date is greater than 2024-01-01 and process only new rows added since the last run:

default_config:
  mode: MASKING
  checkpoint:
    mode: ignore_table

tables:
  - table_name_with_schema: "public.payment"
    mode: MASKING
    filter: payment_date >= timestamp '2024-01-01T00:00'
    checkpoint:
      mode: incremental
      column_name: payment_date
      source:
        type: last_value

table_truncation_mode: IGNORE
schema_creation_mode: DO_NOT_CREATE

It is also possible to use subsetting with target_ratio in combination with incremental masking. However, this approach is more complex and might not be efficient for large datasets.