Overview

Synthesized is the Python SDK of the Synthesized DataOps platform.

The functionality offered by the ML core can be considered as three different stages of the synthesis process.

  1. Analysis - creating a description/understanding from a given dataset.

  2. Augmentation - modifying a description with another description.

  3. Curation - creating a dataset from a given description/understanding.

We look at achieving these processes on three different levels, which correspond to different scales of data.

  1. DataSeries - A single column of data.

  2. DataFrame - A single table or multiple columns of data.

  3. DataBase - A single database or multiple data frames/tables of data.

These processes are outlined by the diagram below and are always concerned with working with data and the information it represents.

Overview of Synthesized's SDK

Use Cases

Here are just a few of the things that Synthesized does well:

  • Generate arbitrary amounts of high quality, privacy-preserving synthetic datasets.

  • Rebalance and reshape existing data through intelligent conditional sampling.

  • Improve the quality of original data by imputing missing values.

The main use case of the Synthesized SDK is to generate new data products using synthetic data, that are privacy-preserving by design. By learning an intelligent model of the original data, Synthesized can be used to generate high quality, self-service data that captures as much of the utility of the original data as possible.

The synthetic data generation is able to incorporate:

Synthesizing DataBases

For very large databases with complicated primary-foreign key relationships, our DataBase Synthesizer SDK is the tool to use. Written in Java, the DB Synthesizer SDK focuses on maintaining properties such as referential integrity and key cardinality whilst generating privacy preserving replicas of production data for development and testing purposes.

Note

Documentation for the DB Synthesizer SDK is available here: DB-Synthesizer Documentation.