Data Compliance Automation

Compliance and security issues can mean that the process of gaining access to production data for ML experiments or testing purposes can be a lengthy and costly process, disrupting the workflow of data engineers and increasing the lead time of projects. Furthermore, once data access is granted it may prove that the dataset is inappropriate for an engineers needs, meaning the whole process of gaining data access must be restarted.

The concerns regarding the presence of sensitive information in data pipelines can be alleviated through the use of synthetic data. Synthetic data (such as the data generated by the HighDimSynthesizer), prevents certain privacy attacks as there is no 1-1 mapping between the original data and the synthetic. A range of tools are available as part of the SDK that can be combined in an extensible manner in order to deal with various types of sensitive information:

Privacy Tool Guide Description

Entity Annotation

Annotations provide the ability to link multiple related columns and synthesize realistic data as a whole.

Differential Privacy

The package provides methods for Differential Privacy which enforce mathematical guarantees on the privacy of the generated data.

Privacy Masks

The privacy masking tools allow for more conventional methods of privacy preservation to be combined with Synthetic data.

Strict Synthesis

The Sanitizer allows for a strict synthesis method where no synthetic data point will be nearby an original data point.

Anonymization and Synthetic Data

Synthesized provides a comprehensive toolset to generate synthetic data that can be efficiently used for development and testing purposes. When deciding what tools to use when generating synthetic data, it is important to understand the types of sensitive data present in the original dataset.

There are typically three common types of data classifications:

  • 1. Client Identifying Data or Personally Identifiable Information (CID or PII data) or data handled as CID/PII

  • Cat:A Direct CID/PII

  • Cat:B Indirect sensitive ID for CID/PII

  • Cat:C CID/PII resulting from a combination of multiple attributes (e.g. Bank Sort Code + Bank Account Number)

  • 2. Cat:D Non-sensitive, identifier for CID/PII

  • 3. Cat:E Non-CID

compliance categories

There will be instances where the most appropriate course of action is to simply redact the sensitive data - either removing the value completely, or replacing with "XXXX". Such an example of Cat A CID would be an individual’s account number or sort code. However, in other instances, it may be sufficient to substitute data fields containing CID with non-CID. For example, an individual’s income is non-sensitive data, but could potentially be used as an identifier for PII making it a Cat D attribute. Therefore, in order to preserve the utility of dataset while ensuring a degree of anonymization it may be appropriate to generalise the numeric value to a broader range. The masking and obfuscation techniques available are detailed in the Privacy Masks documentation.

Where identifying or sensitive information is spread across multiple linked columns, the dataset can be easily annotated to link the appropriate fields together. When annotated, Synthesized will learn to generate realistic entities as a whole, rather than independently generating individual attributes. In addition, Synthesized supports attribute-level pseudo-anonymization - where information relating to a data subject (e.g. the data subject’s name) is substituted with a pseudonym/identifier (e.g. token, mnemonic, etc.) in order to make it difficult (for a receiving unauthorized party) to attribute such information to an identifiable data subject. As an example, a dataset may contain first names, last names and email address made up of the combination of the two. In this case the sensitive nature of the data in the name fields can be substituted and a consistent email address generated. For a full list of possible entities that can be generated, see the Entity Annotation guide.

However, as explained in our blog posts How Weak Anonymization Became a Privacy Illusion and Three Common Misconceptions about Synthetic and Anonymized Data, when used in isolation anonymization techniques can still be broken and may not fully satisfy all compliance requirements. A more secure and robust option from the compliance point of view is to combine anonymization and then data generation from the anonymized data. That way, it can be guaranteed that there is no 1-to-1 mapping between entries in the original and synthetic data. In addition, the categories and data types are completely new semantically as well, i.e. there is no connection to the original data type.

If no annotations or masks are used in the configuration, the synthesizer will still generate data that can be linked to the original data. A simple example that illustrates this would be when synthesizing the following table with two columns.

Row Object

1

Apple

2

Orange

3

Orange

4

Apple

5

Orange

Here, the synthesizer can generate a table with a different number of rows, and no single synthetic row corresponds to a single original row, but the actual catagories are still the same. In this sense, the synthetic data can be linked to the original data.

And so when there is a column containing PII such as customer names instead of "Apple" and "Orange", the synthesizer should be configured with annotations and/or masks as to preserve the privacy (as described on this page). An annotation such as PersonAnnotation could be used to generate new names that could not linked to the original names.

Taking the above into account, Synthesized offers a significantly higher level of privacy compared to anonymization techniques alone making it impossible to determine original records from the synthetic data with any guarantee.