Performance Benchmarks
We provide execution timings for a collection of transformations with the Synthesized product which define a data pipeline for a Dev & Test environment.
The collection of transformations can be used for testing of software applications and service but also for machine learning and analytics use cases.
Concrete benchmarks depend on three key variables:
-
Source data (number of columns, number of rows, tables, databases)
-
Type of transformation applied to data for a given application (software testing, testing of machine learning models)
-
Software to execute transformations
There are also minor dependencies on the type of data sources (database vs flat file vs data warehouse), data types (float, json, etc.) but we consider those effects insignificant for the purpose of this section. |
Typical hardware requirements to execute transformations
The choice of hardware largely depends on
-
The typical type of data used with the software
-
The most frequent application of the software
Single Table, Independent Attributes, 30 Columns, 100k records |
2vCPU/4GB RAM |
Single Table, Dependent Attributes, 30 Columns, 100k records |
2vCPU/4GB RAM |
Single table, 40 columns, 10mil records |
4vCPU/8GB RAM |
Performance benchmarks
For generating high quality fake data for 1-2 (possibly linked) tables. Useful for filling in missing values of data, backfilling historical data and generating privacy compliant data for analytics or Machine Learning tasks. Synthetic data will be statistically similar to the original so that many tasks that rely on these statistical signals can be done on the synthetic data.
Dataset Size* | 2vCPUs/4GB RAM | 4vCPUs/8GB RAM | 8vCPUs/16GB RAM | 16vCPUs/64GB RAM |
---|---|---|---|---|
Small (< 10^5 rows) |
12min |
10.5min |
8min |
5min |
Medium (>10^5 < 10^6 rows) |
(RAM limited) |
12min |
11min |
6.5min |
Large (>=10^7 rows) |
(RAM limited) |
(RAM limited) |
(RAM limited) |
20 min |
*Example datasets are 30 columns each, half of which are categorical and half are continuous data. We also allow for configuration options which limit the time to for the generation at the expense of data quality.