Improving Performance

System requirements

The machine that Synthesized is hosted on can impact how long it takes to process data.

The performance of Synthesized TDK depends heavily on the machine it runs on. We recommend the following minimum configuration:

CPU: 8 cores
RAM: 32GB
Storage: SSD with at least 100GB of free space
Network: 10Gbps bandwidth

These requirements provide a good baseline. However, depending on your specific data volume and database complexity, you may need to adjust these values.

Optimizing network latency

Synthesized reads from a source database, processes the data, and writes it to a target database. Network latency between the TDK and the databases directly affects throughput.

For optimal performance:

Deploy TDK and database instances within the same data center or cloud region (e.g., AWS, GCP).
If that’s not possible, deploy Agents near the databases to reduce latency.

TDK Agents are not AI agents and do not transmit any database-contained data to Governor. All processing remains local to the Agent, ensuring data privacy.

Enabling working directory

To reduce in-memory processing and elevate overall performance, Synthesized can use the local file system to optimize throughput by using disk space.

Working directories can be set for both the Synthesized backend and its agent s within the system’s environment variables, e.g. in the docker-compose file. The system should be pointed to an empty folder where it can store files.

Docker

Ensure the working directory is enabled in the environment section under backend:

    TDK_WORKINGDIRECTORY_ENABLED=true
    TDK_WORKINGDIRECTORY_PATH=/app/rocksdb

To enable working directories in agents, add a similar section under agent:
```
    AGENT_WORKING_DIRECTORY_PATH=/app/rocksdb
```
In Docker deployments we recommend setting up a volume for working directories:
```
    volumes:
      - [your path to a working directory]:/app/rocksdb
```
This should be done in both the backend and agent sections.
Ensure that this folder is accessible for Docker by running the command:
```
chmod -R 777 <RocksDB host directory path>
```

Controlling database parallelization

TDK supports concurrent connections to databases to improve throughput. Set the pool size via:

TDK_DB_MAXIMUM-POOL-SIZE=20

Adjust this based on your database’s capabilities and available system resources.

Reducing processed data volume

If you own a large system, it is critical to understand how much data is in the system, and how much data you need when testing. Pragmatic reductions save disk space and reduce processing time.

Data reduction techniques

Ignoring Schemas

Many systems rely on a limited number of schemas within a shared database. Setting schema filters ensures that Synthesized only processes relevant schemas.

Ignoring Tables

An easy win to improve performance is to ignore low utility tables such as audit logs that are not part of the testable system. These tables are often some of the largest in the system, and add little value to the test environment. Setting the target_ratio: 0 for a table will skip it entirely.

Subsetting

Subsetting selects a representative sample of the data using the target_ratio option. A target_ratio of 0.1 will attempt to maintain 10% of the data. Importantly, the subset of data will maintain all referential integrity, so the subset of data will work together.

Target ratios are targets rather than rules. Data will never exceed the target ratio, but it may be below it due to foreign key constraints. Subsetting a parent table may not leave enough data in its child tables to meet their target ratios.

Filtering

Filtering allows you to set restrictions on which data is processed. This allows you to filter for specific conditions (e.g. time periods, geographies, products).

Subsetting is not proportional

Subsetting a database is faster than processing an entire database. However, subsetting is slower on a per-row basis. Changing the target_ratio does not affect the performance time linearly. Increasing a target_ratio of 0.01 to 0.1 will not take 10x as long to process. It is important to bear this in mind when trialling the system, as naive multiplication can lead to pessimistic predictions of overall system performance.

Subsetting is designed to significantly reduce the amount of data being processed, so works best with small ratios. It can be more efficient to use a target_ratio of 1 than a target ratio that is slightly below 1.

Prefer filtering on large datasets

As subsetting is a sampled process, it involves reading individual rows from across the disk. For extremely large tables, this sampling can impact processing times.

If filtering can be used to select a representative sample (last X months, entries since X), this can be significantly faster than sampling.

Avoid filters on non-indexed columns

Synthesized filters translate to WHERE clauses at the database level. As such, general database best practices that apply to WHERE clauses also apply to filters.

Performance can degrade significantly if you filter non-indexed columns in large tables. Columns with a B-tree indexes (the default in most databases) can be filtered efficiently for ranges of values.

Optimising effective configuration

Configuration in Synthesized is managed at two levels:

User configuration - parameters you set explicitly
Effective configuration - parameters filled in automatically when no explicit value is provided

Effective configuration is more than a list of static defaults. Synthesized scans the source database to infer settings that best match your data. This reduces manual setup and keeps settings aligned as the data evolves. The trade-off is that effective configuration can increase processing time. In particular, numerical columns are queried to find statistical parameters, while categorical fields are queried to determine distributions.

If effective configuration becomes a bottleneck, include more explicit parameters in your configuration. You can copy these values directly from the Database Schema tab:

Open the tab.
Select a column.
Copy the displayed effective configuration into your workflow config.