The SDK generates high quality, privacy-preserving datasets for machine learning and data science use cases. It’s available on PyPI for a free 30-day trial. Install the SDK Now!
Version 3.1 released!
Additional Spark dtypes, automatic enumeration, performance improvements, and more!
A complete list of changes is available in the changelog.
New in v3.0: Spark Integration
Version 3.0 provides native integration with pyspark for tabular synthesis.
New in v3.0: Native Pandas and Spark masking
Version 3.0 provides expanded masking capabilities implemented natively for pandas and pyspark dataframes.
Internally, the process of handling data by Synthesized can be broken down into three steps after the data is loaded into the python SDK from a data source:
Annotate and preprocess data: the software understands the data formats and types automatically. It is able to handle missing data and erroneous values.
Build a mathematical generative model of data: the software builds a generative representation, a mathematical equation which encapsulates how the properties of data should look like. Internally, this equation allows the user to take pure data noise (a sequence of standard normal random variables) and transform them into the output data which has the properties of original data.
Synthesize a new dataset from the generative model: Finally, when the generative model is trained it can be used to generate new samples of data on demand. Furthermore, the software enables data manipulation which is used to rebalance some of the variables in data so that the output data has the desired properties.
Explore Python API for some of the core components of the SDK.
View comprehensive tutorials and download example IPython notebooks which guide through various scenarios.