How to create a YAML configuration for my database?
You can run TDK without manually defining transformations for all tables and columns:
default_config: mode: MASKING target_ratio: 1 safety_mode: RELAXED
default_config: mode: GENERATION target_ratio: 2 safety_mode: RELAXED
TDK automatically applies appropriate transformations based on mode, column types, and source data.
If you want more control over the transformations, you can print a default configuration by running the CLI
java -jar tdk.jar \ --input-url jdbc:postgres://host:port --input-username user \ --output-url jdbc:postgres://host:port --output-username user \ dry-run --ec /path/to/create/configuration.yaml
The resulting configuration will contain transformations for all tables and columns in your schema. You can then use this configuration to run the TDK workflow.
How can I ensure the TDK process doesn’t take the source database down?
TDK does not block other reads and writes to your source database. You can also control the number of parallel connections to the source database:
1) Create an
application.properties file in the same directory as the TDK jar
2) Add the following lines:
Note: The default number is
It’s recommended to use TDK against a replica database, rather than a production instance, to minimize the risk of any impact on your source database.
How can I skip some tables from the TDK workflow?
To skip tables from the TDK generation or masking process, set the
0. For example:
default_config: mode: GENERATION target_ratio: 2 tables: - table_name_with_schema: "demo.productlines" target_ratio: 0 - table_name_with_schema: "demo.products" target_ratio: 0
As a result,
products tables will be empty in the output database.
How does TDK preserve referential integrity?
TDK automatically preserves relationships between tables based on foreign keys (FKs) defined in the database schema.
There’s also a way to define additional FKs in the yaml configuration file if you don’t have them in the database schema. For example, if
order.user_id is a foreign key referred to
user.id, and it’s not defined in the database schema, then the following configuration can be provided:
default_config: mode: "MASKING" target_ratio: 0.5 metadata: tables: - table_name_with_schema: "public.order" foreign_keys: fk_user_order: referred_schema: "public" referred_table: "user" columns: - column: "user_id" referred_column: "id"
How does TDK work with a live database?
TDK is designed to handle inserts/deletes/updates happening to the source database during workflow execution, ensuring that these changes do not affect the consistency of the resulting target database.
It’s important to note that working with a live database can lead to subsetting, which is a more time-consuming operation.
Is TDK generation result repeatable?
Yes, if the same input parameters as the source database, configuration file and TDK version are used.
Does TDK overwrite existing data or add to it?
It’s recommended to use an empty schema in the target database. You can use truncation mode in your configuration file, and TDK will truncate the target schema before inserting new data:
TDK does not take into account the existing data in the target database, so if you try to add more data above the existing, TDK may fail during insertion due to unique constraints.
How do different parameters affect TDK performance?
The number of tables that can be processed in parallel depends on the complexity of the relationships (i.e., the presence and the number of foreign keys) between them. The fewer relationships, the more tables can be processed in parallel, and the faster the execution.
The amount of data affects performance linearly; if the data is doubled, the execution time will also be doubled.
Network latency between TDK and the databases significantly affects performance. Optimal performance is achieved when the TDK and the database instances are close to each other – for example, when they are within the same data center or region of a cloud provider such as GCP or AWS.
Subsetting is the most complex scenario, and it takes longer to execute than MASKING or GENERATION.
The following parameters are available to optimize performance inside TDK:
TDK_WORKINGDIRECTORY_ENABLED enables to use a local file system to optimize TDK throughput at the cost of occupied disk space:
TDK_DB_MAXIMUM-POOL-SIZE allows to control the connection pool size and the level of parallelism for data transferring: