Choosing the right transformation

Synthesized provides a comprehensive toolset to create test data that can be efficiently used for development and testing purposes. Test data is a result of a transformation applied to input data combined with data generation. The product suits a variety of purposes for test data and you can find information on the features such as data rebalancing, data imputation, hashing, etc., in this documentation. The output test data is typically used in software development cycles, QA, testing of ETL engines, exploratory testing, exploratory analysis in data science/machine learning tasks, testing of hypotheses, building synthetic data centers of excellence, and more. Please reach out to your product manager assigned with the license for additional information.

Here we provide an illustrative example of how to choose the right transformation for a given input dataset and end goal. There are 2 key steps:

  • Step 1. Understand data requirements for a given task

  • Step 2. Understand internal data categorization for compliance and metadata

Step 1. Understand data requirements for a given task

The software provides a comprehensive toolkit to create data. Oftentimes software engineers, product managers, machine learning engineers and data scientists have concrete requirements for data, be it test data or a training dataset. Understanding of these requirements is essential to pick the right set of transformations.

User requirements are often tied to a specific use case - be it an exploratory BI task, exploratory testing or testing of a given machine learning model

Here we provide a list of typical user requirements on a dataset level:

  • Dataset name and location

  • Application / use case description and intended purpose, i.e. churn modeling

  • List of tables required to be delivered

  • List of columns within the tables a) which are needed for the user and b) which are NOT needed to meet the business requirements. More complex requirements typically require a longer processing time and alignment with data compliance and governance requirements (see below)

  • List of columns where statistical properties are desired for user needs

  • List of columns with functional requirements and rules such as explicit formulas between columns / nested column structured (Country - City)

  • Overall dataset size / volume requirements

  • Specific data group size / volume requirements

These requirements are normally addressed by the software through three key transformations:

  • Data reshaping / subsetting

  • Data subsetting

  • Data rebalancing

  • Bootstrapping concrete data groups

  • Data generation

  • Generation preserving statistical properties

  • Generation following given distributions and rules

  • Data masking / obfuscation

  • Masking according to regular expressions

  • Hashing, noising, reduction

  • Sampling from a dictionary

Depending on the intended purpose, one transformation is often preferred over another.

Step 2. Understand internal data categorization for compliance and metadata

Once we have access to the input data and understand data requirements in detail, it remains to understand data categorization and metadata (as it is sometimes required for compliance reasons). This can often be retrieved from an internal data catalogue or an internal data governance system. There are typically different data classification levels tied to different types of data within the organization.

There are typically three common types of data classifications:

  • 1. Client Identifying Data or Personally Identifiable Information (CID or PII data) or data handled as CID/PII

  • Cat:A Direct CID/PII

  • Cat:B Indirect sensitive ID for CID/PII

  • Cat:C CID/PII resulting from a combination of multiple attributes (e.g. Bank Sort Code + Bank Account Number)

  • 2. Cat:D Non-sensitive, identifier for CID/PII

  • 3. Cat:E Non-CID

compliance categories

We can annotate a given dataset following internal data classifications and metadata.

Example categorization:
  • Account number Cat:A CID/PII

  • Depot code Cat:A CID/PII

  • Income Cat:D Non-sensitive, identifier for CID/PII

  • Age Cat:D Non-sensitive, identifier for CID/PII

Once the dataset is annotated according the internal data classification, we can pick the right transformations that need to be applied to that specific category.

Cat:A CID Masked / fake generated / masked & synthesized preserving properties

Cat:B CID Masked / fake generated / masked & synthesized preserving properties

Cat:C Resulting CID Masked / fake generated / masked & synthesized preserving properties

Cat:D Non-sensitive, identifier for CID synthesized preserving properties / masked

Cat:E Non-CID synthesized preserving properties / masked / can be kept

Additional examples of data requests and suggested actions:

Column X is not part of the data requirements and is Cat:A CID we can apply masked / fake generated, e.g. customer ID which is not needed for a BI task / data science.

Column Y is part of the data requirements (data bootstrapping & rebalancing) and is Cat:A CID we can use masked & synthesized preserving properties to get to the right data volume and volume of a specific data group.

Column Z is part of data requirements (testing of a data science model) and is Cat:D we can use synthesized data that preserves statistical properties to ensure the required information is not lost.

Column V is part of data requirements (testing of a software application) and is Cat:D we can use fake data generation / sampling from a dictionary / keep it to preserve the look and feel of production data.

Sanity checks and recommendations

It is important to understand the requirements for the test data as it defines what transformation needs to be applied to a given table or a column within the table.

Common sanity checks:

BI / Data Science often require the statistical properties of the data to be preserved

Typical transformations required for test data for BI / Data Science include:

  • Data subsetting

  • Data rebalancing

  • Bootstrapping concrete data groups

  • Data generation

  • Generation preserving statistical properties

BI / Data Science often require the same type of data to be preserved

Data Science models normally need to work on production-like data types so that when they are deployed to production, they don’t break and have already been tested.

That means that many masking techniques are not recommended when the intended purpose is to use test data in BI/Data Science.

Exploratory testing tasks don’t require statistical properties but often require rule dependencies between specific tables to be preserved.

For software engineering and testing tasks, it’s essential to understand the overall look and feel of a dataset to run component testing, usability testing, performance testing or load testing for a given software functionality or a feature but high-order statistical properties are not required.

That means that data generating techniques/sampling from a dictionary can often be sufficient as opposed to preserving all correlations in data. At the same time, masking is often not desired as it “destroys” the look of test data, i.e., it may not follow the same data types required as an input for a given software application.

Categorical data is sometimes CID

As a sanity check, some categorical data is CID and hence needs to be sampled from a dictionary or anonymized & synthesized. At the same time, a lot of categorical data is not CID, but can be an identifier for CID. Hence, synthesizing data and preserving its statistical properties suits best when creating test data of this nature.

Choosing between anonymization vs sampling

When choosing between anonymization vs sampling to meet compliance requirements for CID types (A. B. & C.), it’s important to understand that anonymization by itself can often be broken, see references and case studies below.

A more secure and robust option from the compliance point of view is to combine anonymization and then data generation from the anonymized data. That way, we guarantee there is no 1-to-1 mapping to original data and also the categories and data types are completely new semantically as well, i.e. there is no connection to the original data type.