Overview

DB-Synthesizer is the Java SDK of the Synthesized DataOps platform that allows the user to work with entire relational databases.

When to Use this Package

DB-Synthesizer is able to work with relational databases with complex structures, such as constraints and tables that refer to other tables among others. DB-Synthesizer is able to generate large databases that replicate the structure and preserve high-level data distributions from other database in minutes. This package is especially useful when working with large amounts of tables and prioritizes scalability.

Use Cases

This software is intended to be used mainly for software development and testing. Relational databases are the central point of many applications, and getting access to data can be crucial for developing, maintaining and testing data-centric applications. But getting access to production data can be difficult and time-consuming (read more here).

DB-Synthesizer provides a secure, privacy preserving, and tailored version of production data that can be used for many purposes, including (i) creating a privacy-compliant replication of production for development and testing, and (ii) generating large amounts of data for performance testing.

High Quality Data Generation

Some low level statistical properties of the data might not be preserved, such as column correlations or predictive utility. When this information is needed in the output data, and scalability to multiple tables is not the priority, it is recommended to use the Synthesized SDK. The SDK is able to learn and generate high quality data products while preserving and even improving data quality without compromising data privacy.

Synthesizing a Database

DB-Synthesizer, once connected to a database, will extract the model, learn all of the necessary information from the database, and generate a Synthetic copy that will be written to the destination database. To run this action, set the argument: -a synthesize.

While the new database preserves the high level information of the source database, it is free of sensitive information and preserves data privacy as no original data is present in the generated dataset.

The following information is preserved from source database to target database:

  • Tables and column names. The schema of all tables and columns will be copied from the original source

  • Data types. All columns in the destination database will have the same data type as in the source database

  • DDL. The DDL for both databases (including constraints, procedures, views, sequences, etc.) will be present in the Synthesized database as they were in the source database

  • Referential Integrity. Primary and Foreign keys will be copied, ensuring referential integrity is preserved so that the user can query data with join statements and obtain similar results. A configuration file with additional relationships can be provided

  • Key Cardinality. Foreign key distributions are generated as close as possible to the source database to ensure similar cardinality

  • Column marginal distributions. Column marginal distributions are approximated and sampled from probability distributions to be similar to those in the original database