Synthesized Platform FAQ

How to create a YAML configuration for my database?

You can run Synthesized Platform without manually defining transformations for all tables and columns:

default_config:
  mode: MASKING
  target_ratio: 1
safety_mode: RELAXED

Or

default_config:
  mode: GENERATION
  target_ratio: 2
safety_mode: RELAXED

Synthesized Platform automatically applies appropriate transformations based on mode, column types, and source data.

If you want more control over the transformations, you can print a default configuration by running the CLI dry-run command:

tdk \
    --input-url jdbc:postgres://host:port --input-username user \
    --output-url jdbc:postgres://host:port --output-username user \
    dry-run --ec /path/to/create/configuration.yaml

The resulting configuration will contain transformations for all tables and columns in your schema. You can then use this configuration to run the Synthesized Platform workflow.

How can I ensure the Synthesized Platform process doesn’t take the source database down?

Synthesized Platform does not block other reads and writes to your source database. You can also control the number of parallel connections to the source database:

1) Create an application.properties file in the same directory as the Synthesized Platform jar

2) Add the following lines:

tdk.db.maximum-pool-size=10

Note: The default number is 20.

It’s recommended to use Synthesized Platform against a replica database, rather than a production instance, to minimize the risk of any impact on your source database.

How can I skip some tables from the Synthesized Platform workflow?

To skip tables from the Synthesized Platform generation or masking process, set the target_ratio to 0. For example:

default_config:
  mode: GENERATION
  target_ratio: 2
tables:
  - table_name_with_schema: "demo.productlines"
    target_ratio: 0
  - table_name_with_schema: "demo.products"
    target_ratio: 0

As a result, productlines and products tables will be empty in the output database.

How does Synthesized Platform preserve referential integrity?

Synthesized Platform automatically preserves relationships between tables based on foreign keys (FKs) defined in the database schema.

There’s also a way to define additional FKs in the yaml configuration file if you don’t have them in the database schema. For example, if order.user_id is a foreign key referred to user.id, and it’s not defined in the database schema, then the following configuration can be provided:

default_config:
  mode: "MASKING"
  target_ratio: 0.5
metadata:
  tables:
    - table_name_with_schema: "public.order"
      foreign_keys:
        fk_user_order:
          referred_schema: "public"
          referred_table: "user"
          columns:
            - column: "user_id"
              referred_column: "id"

How does Synthesized Platform work with a live database?

Synthesized Platform is designed to handle inserts/deletes/updates happening to the source database during workflow execution, ensuring that these changes do not affect the consistency of the resulting target database.

It’s important to note that working with a live database can lead to subsetting, which is a more time-consuming operation.

Is Synthesized Platform generation result repeatable?

Yes, if the same input parameters as the source database, configuration file and Synthesized Platform version are used.

Does Synthesized Platform overwrite existing data or add to it?

It’s recommended to use an empty schema in the target database. You can use truncation mode in your configuration file, and Synthesized Platform will truncate the target schema before inserting new data:

table_truncation_mode: TRUNCATE

Synthesized Platform does not take into account the existing data in the target database, so if you try to add more data above the existing, Synthesized Platform may fail during insertion due to unique constraints.

How do different parameters affect Synthesized Platform performance?

  • The number of tables that can be processed in parallel depends on the complexity of the relationships (i.e., the presence and the number of foreign keys) between them. The fewer relationships, the more tables can be processed in parallel, and the faster the execution.

  • The amount of data affects performance linearly; if the data is doubled, the execution time will also be doubled.

  • Network latency between Synthesized Platform and the databases significantly affects performance. Optimal performance is achieved when the Synthesized Platform and the database instances are close to each other – for example, when they are within the same data center or region of a cloud provider such as GCP or AWS.

  • Subsetting is the most complex scenario, and it takes longer to execute than MASKING or GENERATION.

The following parameters are available to optimize performance inside Synthesized Platform:

TDK_WORKINGDIRECTORY_ENABLED enables to use a local file system to optimize Synthesized Platform throughput at the cost of occupied disk space:

TDK_WORKINGDIRECTORY_ENABLED=true
TDK_WORKINGDIRECTORY_PATH=/home/tdk/working-directory

To enable the working directory usage in agents, use the following agent’s property:

AGENT_WORKING_DIRECTORY_PATH=/home/tdk/working-directory

TDK_DB_MAXIMUM-POOL-SIZE allows to control the connection pool size and the level of parallelism for data transferring:

TDK_DB_MAXIMUM-POOL-SIZE=20

Are there any specific hardware requirements need to be met for optimal performance?

The hardware requirements for optimal performance depend on the database size. As a general recommendation, we suggest the following configuration:

  • CPU: 8 cores

  • RAM: 32GB

  • Storage: SSD, 100GB free space

  • Network: 10Gbps bandwidth

Please note that these are general guidelines. Depending on the specific use-case and database, adjustments might be necessary.

What the difference between MASKING and GENERATION?

MASKING and GENERATION are two very different modes, each with their own advantages and limitations. The following highlights can be useful in helping to understand them better:

  • MASKING mode transforms the values of the source database.

  • MASKING mode with target_ratio: 1 masks the entire source database.

  • MASKING mode with target_ratio of < 1 masks the source database with subsetting.

  • MASKING mode does not work with target_ratio of > 1.

  • GENERATION mode learns the statistical properties of the source data and generates completely new data.

  • GENERATION mode with target_ratio: 1 generates the database with the same number of rows as those in the source database.

  • GENERATION can use any target_ratio; more or less data can be generated.

  • GENERATION can be used on an empty schema, and data will be generated based on column types and referential integrity(foreign and primary keys) defined in the database.

  • Users can provide custom GENERATION or MASKING rules to both modes.

  • MASKING and GENERATION modes can be combined in one workflow. Some tables can be masked while others can be generated. Synthesized Platform automatically preserves referential integrity between them based on the foreign and primary keys.

When synthesizing data, is there any limitation on the number of columns / tables?

There is no limitation on the number of columns / tables being synthesized as part of the data generation process.

Is there an option to just use canonical data schema (only structure) to produce a dataset?

Yes, data generation from schema is enabled using the generation mode of Synthesized Platform.

What is the maximum database size handled? What are the infrastructure considerations to scale performance with size?

Synthesized is commonly used to handle TBs of data.

Is it possible to subset the data using specific test conditions, like using SQL query-like filtering?

Database subsetting is available with the software. The documentation is available here.

How does the software remember the data profiling and sensitive data identification for the next run?

The workflows get saved in the system and keep track of sensitive data annotations for the next run.

Is it possible to apply data synthesis only to selected columns and use other column data as it is?

Yes, it is possible to apply data synthesis only to selected columns and use other column data as it is.

Is there a capability within the tool to perform data profiling (query and understand data)?

To perform data synthesis, Synthesized Platform performs data profiling as part of the workflow.

Can you define custom relationships amongst the tables? If so, does the tool have the capability to persist user-defined relationships?

Synthesized understands relationships between tables automatically. It also has the capability to persist user-defined relationships.

Please provide a list of supported databases.

Synthesized supports all relational databases. Available out-of-the-box with no additional configuration: PostgreSQL, MySQL, MariaDB, Oracle, MSSQL, SQLITE, DB2. Additional relational database support is provided for the following databases: Aurora MySQL Edition, Aurora PostgreSQL Edition, Azure SQL Data Warehouse (Azure Synapse Analytics), Derby, Firebird, H2, HANA, HSQLDB, Informix Ingres, MariaDB, Microsoft Access, Redshift, Sybase Adaptive Server Enterprise, Sybase SQL Anywhere, Teradata, Vertica.

What types of logs and reports are available post-masking and post-generation execution to aid in identifying and sampling the results?

The detailed workflow configuration and execution plan are available both before and after the execution. This includes a list of all transformations and their parameters for all columns and tables in the processed schemas.

Extensive logs include detailed information about the workflow execution, starting from connecting to the databases, all performed transformations, and insertion logs, with all warnings and errors that occurred.

Regarding database password security while executing the workflows via CLI commands, how can we encrypt the DB credentials in the CLI commands or through API channels?

Synthesized Platform supports integrations with various Secret Managers. More options can be added by request. The HashiCorp Vault integration tutorial provides an idea of how integration can be used with Synthesized Platform CLI.

Does the Synthesized capability support encryption masking with various encryption methods, and does it also allow reverse masking through decryption transformations?

The primary masking transformer, format_preserving_hashing, uses the FF1 encryption algorithm (Section 5.1, FF1).

Profiling and PII Discovery Tool Integration: Since Synthesized does not have in-built PII scanning functionality, can it be integrated with open-source Data profiling and Personally Identifiable Information (PII) discovery Solutions?

Yes, it’s possible. We have experience integrating with various solutions such as BigID. In such integration, insights from Data Profiling and PII Discovery can be used to auto-tune the workflow configuration. Synthesized Platform can be integrated with open-source Data Profiling and PII Discovery solutions in a similar way.

Can Synthesized be integrated with any open-source tools which can compare schemas and detect schema changes between the source and target databases? Does Synthesized assist in comparing data between the source and target databases, ensuring consistency and accuracy?

Yes, Synthesized Platform supports integration with Flyway, and the demo is available here. Flyway supports Drift Detection. Synthesized Platform focuses on data masking and generation and can be easily integrated with other tools for comparing the source and target databases.

Does Synthesized Platform support SaaS-based Databases and Applications like Oracle ERP, Salesforce, etc.?

Synthesized Platform is designed to process complex enterprise schemas with extremely flexible YAML configuration.

Does Synthesized support masking or generating Blob Data types and other special Data types?

Blob and other special data types require additional user configuration, which is possible through flexible value mapping.

Is it possible to pause or abort any Synthesized Platform Execution (Masking, Generation) once we start the execution? What execution controls are available during Synthesized Platform execution?

Yes, after starting the workflow with the Synthesized UI, the 'Cancel run' button is available to stop execution at any step. In the CLI, the process can simply be interrupted.

With a higher number of columns and tables, do we need GPU support?

No, GPU support is not required for a higher number of columns. The software is already optimized for CPU consumption.