Performance Benchmarks

We provide execution timings for a collection of transformations with the Synthesized product which define a data pipeline for a Dev & Test environment.

The collection of transformations can be used for testing of software applications and service but also for machine learning and analytics use cases.

Concrete benchmarks depend on three key variables:

Source data (number of columns, number of rows, tables, databases)
Type of transformation applied to data for a given application (software testing, testing of machine learning models)
Software to execute transformations

There are also minor dependencies on the type of data sources (database vs flat file vs data warehouse), data types (float, json, etc.) but we consider those effects insignificant for the purpose of this section.

Typical hardware requirements to execute transformations

The choice of hardware largely depends on

The typical type of data used with the software
The most frequent application of the software

Single Table, Independent Attributes, 30 Columns, 100k records

2vCPU/4GB RAM

Single Table, Dependent Attributes, 30 Columns, 100k records

2vCPU/4GB RAM

Single table, 40 columns, 10mil records

4vCPU/8GB RAM

Performance benchmarks

For generating high quality fake data for 1-2 (possibly linked) tables. Useful for filling in missing values of data, backfilling historical data and generating privacy compliant data for analytics or Machine Learning tasks. Synthetic data will be statistically similar to the original so that many tasks that rely on these statistical signals can be done on the synthetic data.

Dataset Size*	2vCPUs/4GB RAM	4vCPUs/8GB RAM	8vCPUs/16GB RAM	16vCPUs/64GB RAM
Small (< 10^5 rows)	12min	10.5min	8min	5min
Medium (>10^5 < 10^6 rows)	(RAM limited)	12min	11min	6.5min
Large (>=10^7 rows)	(RAM limited)	(RAM limited)	(RAM limited)	20 min

Dataset Size*

2vCPUs/4GB RAM

4vCPUs/8GB RAM

8vCPUs/16GB RAM

16vCPUs/64GB RAM

Small (< 10^5 rows)

12min

10.5min

8min

5min

Medium (>10^5 < 10^6 rows)

(RAM limited)

12min

11min

6.5min

Large (>=10^7 rows)

(RAM limited)

20 min

*Example datasets are 30 columns each, half of which are categorical and half are continuous data. We also allow for configuration options which limit the time to for the generation at the expense of data quality.