The Synthesized Python package provides the ability to generate high quality structured synthetic data.
The functionality offered can be considered as three different stages of the synthesis process.
Analysis - creating a description/understanding from a given dataset.
Augmentation - modifying a description with another description.
Curation - creating a dataset from a given description/understanding.
We look at achieving these processes on three different levels, which correspond to different scales of data.
DataSeries - a single column of data.
DataFrame - a single table or multiple columns of data.
DataBase - a single database or multiple data frames/tables of data.
These processes are outlined by the diagram below and are always concerned with working with data and the information it represents.
Here are just a few of the things that Synthesized does well:
Generate arbitrary amounts of high quality, privacy-preserving synthetic datasets.
Rebalance and reshape existing data through intelligent conditional sampling.
Improve the quality of original data by imputing missing values.
The main use case of the Synthesized SDK is to generate new data products using synthetic data, that are privacy-preserving by design. By learning an intelligent model of the original data, Synthesized can be used to generate high quality, self-service data that captures as much of the utility of the original data as possible.
The synthetic data generation is able to incorporate:
Realistic PII and sensitive data generation, such as names, bank account details and addresses using entity annotations.
Strict business rules and logic using rule specification.
Generation of realistic custom formatted strings such as social security numbers.
Custom scenario generation for testing purposes using conditional sampling
For very large databases with complicated primary-foreign key relationships, our DataBase Synthesizer SDK is the tool to use. Written in Java, the DB Synthesizer SDK focuses on maintaining properties such as referential integrity and key cardinality whilst generating privacy preserving replicas of production data for development and testing purposes.
Documentation for the DB Synthesizer SDK is available here: DB-Synthesizer Documentation.