[experimental] Incremental Masking
This feature is experimental and may change in future releases. |
Incremental masking is a feature that allows you to apply masking rules to data in stages, rather than all at once. This is particularly useful when dealing with large tables that need to be kept up-to-date without having to reprocess the entire dataset.
If you want to jump right in, follow the Incremental Masking Quickstart guide.
Configuring Incremental Masking
To configure incremental masking, you need to set the checkpoint
with mode: incremental
in the configuration for each table you want to mask incrementally:
checkpoint:
mode: incremental
column_name: <column_name>
source:
type: last_value
TDK filters rows where <column_name>
is greater than <last_value>
and updates the <last_value>
to be used in the subsequent runs. Use a monotonic, indexed not nullable column, e.g., created_at
or an auto-incrementing id
.
Using non-monotonic columns such as updated_at can lead to performance overhead or duplication. Using a nullable column will skip rows with NULL values, potentially causing data loss. Using a non-indexed column can lead to performance issues.
|
The following table summarizes the available checkpoint modes, their behaviors, and constraints:
Mode | Description | Constraints |
---|---|---|
|
• Processes only new rows based on the checkpoint column |
• Requires monotonically increasing column (timestamp, auto-incrementing ID) |
|
• Default mode |
• |
|
• Skips processing the table entirely after the first run, which is similar to setting |
• No special constraints |
Additionally, the table_truncation_mode: IGNORE
must be set in the configuration file to ensure that the table is not truncated before each run. During incremental runs, avoid CREATE
/DROP_AND_CREATE
to prevent data loss. For the first run, you can set schema_creation_mode
: CREATE_IF_NOT_EXISTS
which could be safely changed to schema_creation_mode
: DO_NOT_CREATE
for subsequent runs.
You can also add a checkpoint section under default_config
to apply it to all tables without a specific checkpoint configuration. We highly recommend using the checkpoint
with mode: ignore_table
option in the default_config
section to avoid processing the same tables multiple times for tables that are not updated frequently. This will transform all tables regularly during the first run, while ignoring tables that do not override the checkpoint
section on the subsequent runs. Usage of checkpoint
with mode: refresh
option is default and will cause TDK to reprocess the entire table in each run, which may not be efficient for large datasets. Use it only if necessary.
Example Configuration
In the example below, we will mask a pair of tables called customer
and payment
, where the payment
table has a foreign key reference to the customer
table and a payment_date
column that will be used as the checkpoint for incremental masking.
payment_id | customer_id | amount | payment_date |
---|---|---|---|
16053 |
269 |
0.99 |
2025-07-02 07:29:53.301 |
16054 |
269 |
4.99 |
2025-07-02 23:51:40.813 |
16055 |
269 |
2.99 |
2025-07-20 19:54:02.174 |
16056 |
270 |
1.99 |
2025-07-23 05:49:30.663 |
16057 |
270 |
4.99 |
2025-07-24 13:27:17.752 |
With configuration below, in the first run, TDK will process both the payment
and customer
tables, remembering the last processed value in payment_date
column.
default_config:
mode: MASKING
checkpoint:
mode: ignore_table
tables:
- table_name_with_schema: "public.payment"
mode: MASKING
checkpoint:
mode: incremental
column_name: payment_date
source:
type: last_value
table_truncation_mode: IGNORE
schema_creation_mode: DO_NOT_CREATE
For a subsequent run, TDK will only process the rows in the payment
table that have payment_date
greater than the maximum payment_date
processed in the previous run. TDK will ignore the customer
table and all other tables without specified checkpoint
section. This allows us to keep the masked data up-to-date without having to reprocess all the data.
Suppose we run the transformation once again, after adding three new rows to the payment
table and one new row with customer_id
= 281
to the customer
table.
payment_id | customer_id | amount | payment_date |
---|---|---|---|
16053 |
269 |
0.99 |
2025-07-02 07:29:53.301 |
16054 |
269 |
4.99 |
2025-07-02 23:51:40.813 |
16055 |
269 |
2.99 |
2025-07-20 19:54:02.174 |
16056 |
270 |
1.99 |
2025-07-23 05:49:30.663 |
16057 |
270 |
4.99 |
2025-07-24 13:27:17.752 |
16058 |
262 |
8.99 |
2025-08-12 10:19:47.019 |
16059 |
253 |
0.99 |
2025-08-19 17:32:36.628 |
16060 |
281 |
0.99 |
2025-08-20 02:45:42.908 |
In this case, TDK will only process the two new rows with payment_id
16058
and 16059
, as they have payment_date
values greater than the maximum payment_date
processed in the previous run. The customer
table will be ignored as the checkpoint.mode
is set to ignore_table
. The row with payment_id
16060
will not be processed as it has a reference to the customer
table with customer_id
281
which was ignored during this run.
What if I want to update other tables?
If other tables are updated, you can still use the incremental masking. With the previous configuration, TDK will only process the rows in the payment
table that have a payment_date
greater than the maximum payment_date
processed in the previous run, and it will ignore the customer
table as specified by the checkpoint.mode: ignore_table
option. This will result in ignoring any new rows in the payment
table that reference new rows in the customer
table.
To address this, you can set the checkpoint in mode incremental
for the customer
table, similar to the payment
table:
default_config:
mode: MASKING
checkpoint:
mode: ignore_table
tables:
- table_name_with_schema: "public.payment"
mode: MASKING
checkpoint:
mode: incremental
column_name: payment_date
source:
type: last_value
- table_name_with_schema: "public.customer"
mode: MASKING
checkpoint:
mode: incremental
column_name: id
source:
type: last_value
table_truncation_mode: IGNORE
schema_creation_mode: DO_NOT_CREATE
In this case, TDK will track both the maximum payment_date
in the payment
table and the maximum id
in the customer
table. If new rows are added to the customer
table, TDK will process them in the next run, ensuring that all related data is masked correctly.
If the customer
table does not have a monotonically increasing column, you can use the checkpoint.mode: refresh
option for the customer
table, which will cause TDK to reprocess the entire table in each run, while keeping incremental processing for payment
table. This is useful if the customer
table is updated frequently, and you want to ensure that all data is masked correctly.
The checkpoint.mode: refresh option will cause TDK to reprocess the entire table in each run, which may not be efficient for large datasets. Use it only if necessary.
|
default_config:
mode: MASKING
checkpoint:
mode: ignore_table
tables:
- table_name_with_schema: "public.payment"
mode: MASKING
checkpoint:
mode: incremental
column_name: payment_date
source:
type: last_value
- table_name_with_schema: "public.customer"
mode: MASKING
target_ratio: 1.0
checkpoint:
mode: refresh
table_truncation_mode: IGNORE
schema_creation_mode: DO_NOT_CREATE
What if I want to start from scratch?
If you want to start from scratch and reprocess all the data in the tables, you can set both flag DO_NOT_CONTINUE_FROM_CHECKPOINT
and either table_truncation_mode
to TRUNCATE
or schema_creation_mode
to DROP_AND_CREATE
in the configuration file. This will cause TDK to ignore the checkpoint, get rid of the existing data in the output tables, and remember the maximum values in the checkpoint columns.
First, you need to execute a clear TDK run using the settings
default_config:
mode: MASKING
checkpoint:
mode: ignore_table
tables:
- table_name_with_schema: "public.payment"
mode: MASKING
checkpoint:
mode: incremental
column_name: payment_date
source:
type: last_value
table_truncation_mode: IGNORE
schema_creation_mode: DROP_AND_CREATE
flags:
- DO_NOT_CONTINUE_FROM_CHECKPOINT
To continue with the incremental runs, you can remove the DO_NOT_CONTINUE_FROM_CHECKPOINT
flag and use the same configuration as before:
default_config:
mode: MASKING
checkpoint:
mode: ignore_table
tables:
- table_name_with_schema: "public.payment"
mode: MASKING
checkpoint:
mode: incremental
column_name: payment_date
source:
type: last_value
table_truncation_mode: IGNORE
schema_creation_mode: DO_NOT_CREATE
What if I want to manage the checkpoint values myself?
If you want to have more control over the checkpoint values, you can use the constant
source type in the checkpoint configuration. This allows you to specify a fixed value start_from
as the starting point for incremental processing. In this case, TDK will not remember the last processed value, and you will need to update the value manually for each run.
The configuration would look like this:
default_config:
mode: MASKING
checkpoint:
mode: ignore_table
tables:
- table_name_with_schema: "public.payment"
mode: MASKING
checkpoint:
mode: incremental
column_name: payment_date
source:
type: constant
start_from: "2025-08-08T00:00"
table_truncation_mode: IGNORE
schema_creation_mode: DO_NOT_CREATE
In this example, TDK will process rows in the payment
table where payment_date
is greater than 2025-08-08 00:00:00
. You will need to update the start_from
field in the configuration file for each run to ensure that only new rows are processed.
On subsetting
It is possible to combine incrementality with subsetting. The most straightforward way is to use filter
field in the table configuration to limit the rows processed.
If you want to maintain a subset of the data while using incremental masking, you can set up a filter that works in conjunction with the incrementality. For example, you can filter rows based on a date range and use the checkpoint to process only new rows within that range.
The following configuration will maintain a subset of the payment
table where payment_date
is greater than 2024-01-01
and process only new rows added since the last run:
default_config:
mode: MASKING
checkpoint:
mode: ignore_table
tables:
- table_name_with_schema: "public.payment"
mode: MASKING
filter: payment_date >= timestamp '2024-01-01T00:00'
checkpoint:
mode: incremental
column_name: payment_date
source:
type: last_value
table_truncation_mode: IGNORE
schema_creation_mode: DO_NOT_CREATE
It is also possible to use subsetting with target_ratio
in combination with incremental masking. However, this approach is more complex and might not be efficient for large datasets.