Performance Benchmarks

We provide execution timings for a collection of transformations with the Synthesized product which define a data pipeline for a Dev & Test environment.

The collection of transformations can be used for testing of software applications and service but also for machine learning and analytics use cases.

Concrete benchmarks depend on three key variables:

  • Source data (number of columns, number of rows, tables, databases)

  • Type of transformation applied to data for a given application (software testing, testing of machine learning models)

  • Software to execute transformations

There are also minor dependencies on the type of data sources (database vs flat file vs data warehouse), data types (float, json, etc.) but we consider those effects insignificant for the purpose of this section.

Typical hardware requirements to execute transformations

The choice of hardware largely depends on

  1. The typical type of data used with the software

  2. The most frequent application of the software

Single Table, Independent Attributes, 30 Columns, 100k records

2vCPU/4GB RAM

Single Table, Dependent Attributes, 30 Columns, 100k records

2vCPU/4GB RAM

Single table, 40 columns, 10mil records

4vCPU/8GB RAM

Performance benchmarks

For generating high quality fake data for 1-2 (possibly linked) tables. Useful for filling in missing values of data, backfilling historical data and generating privacy compliant data for analytics or Machine Learning tasks. Synthetic data will be statistically similar to the original so that many tasks that rely on these statistical signals can be done on the synthetic data.

Dataset Size* 2vCPUs/4GB RAM 4vCPUs/8GB RAM 8vCPUs/16GB RAM 16vCPUs/64GB RAM

Small (< 10^5 rows)

12min

10.5min

8min

5min

Medium (>10^5 < 10^6 rows)

(RAM limited)

12min

11min

6.5min

Large (>=10^7 rows)

(RAM limited)

(RAM limited)

(RAM limited)

20 min

*Example datasets are 30 columns each, half of which are categorical and half are continuous data. We also allow for configuration options which limit the time to for the generation at the expense of data quality.