Privacy

Synthetic data (such as the data generated by the HighDimSynthesizer), automatically prevents certain privacy attacks as there is no 1-1 mapping between the original data and the synthetic. Additional privacy metrics are available and the framework is extendable; these metrics are detailed in this section of the documentation.

The package provides methods for Differential Privacy which enforce mathematical guarantees on the privacy of the generated data; the privacy masking tools allow for more conventional methods of privacy preservation to be combined with Synthetic data; finally, the Sanitizer allows for a strict synthesis method where no synthetic data point will be nearby an original data point.

Examples of anonymization:

  • Deleting data fields containing CID (redaction)

  • Substituting data fields containing CID with non-CID. The deletion or substitution process must not allow a restoration or recovery of the original data.

There will be instances where the most appropriate course of action is to simply redact the sensitive data - either removing the value completely, or replacing with 'XXXX" for example.

Synthesized supports attribute-level data anonymization - where information relating to a data subject (e.g. a clients name) is removed, thereby eliminating the possibility of identifying the data subject. The process is irreversible and achieved with data obfuscation. If needed, the user can obfuscate any data with any of the following techniques:

  1. Partial masking. Values can be partially (or totally) be substituted by a placeholder character, ""x"" by default. For example, value ""4905 9328 9320 4630"" would be replaced ""xxxx xxxx xxxx 4630"".

  2. Nulling. The contents of a column can be completely removed, and the output dataset would contain an empty column.

  3. Swapping. The output column contains same unique values as the input one, but they are randomly shuffled so that correlations with other columns are completely lost.

  4. Random strings. Generate random strings with similar format to input values, for example ""490GH830L"" could be transformed into ""L3N8O3H2M"".

  5. Generalization. Individual values of attributes are replaced with a broader category. For example, the value '19' of the attribute 'Age' may be replaced by ' ≤ 20', the value '23' by '20 < Age ≤ 30' , etc."

Synthesized supports attribute-level pseudo-anonymization - where information relating to a data subject (e.g. the data subject’s name) is substituted with a pseudonym/identifier (e.g. token, mnemonic, etc.) in order to make it difficult (for a receiving unauthorized party) to attribute such information to an identifiable data subject. The process can be configured to be reversed, i.e. the original data element is stored together with the token in a secure system of choice subject to appropriate access control. The continuity of the substitution of production values with realistic ""fake"" data can be maintained across attributes using the Synthesized annotation feature (e.g. a real name will be replaced by a ""fake"" name and this same ""fake"" name can be used to create a ""fake"" email address etc).

Generated ""fake"" data is coherent across columns. For example, say the combination of (‘title’, ‘first_name’, ‘last_name’, ‘gender’, ‘email’) describes a unique person in a dataset, there is a strict relationship between these attributes, so when “title” is “Mrs” or “Ms” ""gender"" will be ""Female"", and “first_name” will contain a name given to females.

When it is important to maintain this correct description of an entity in the generated synthetic data, the dataset can be easily annotated to link the appropriate fields together. When annotated, Synthesized will learn to generate realistic entities as a whole, rather than independently generating individual attributes.

The following annotation entities are enabled by default and others can be configured as well:

  1. Person. Labels such as title, gender, full_name, first_name, last_name, email, username, password, mobile_number, home_number, work_number can be annotated and generated.

  2. Address. Labels such as postcode, county, city, district, street, house, flat, house_name, full_address can be annotated and generated.

  3. Bank Account. Labels such as bic, sort_code, account can be annotated and generated.

  4. Formatted String. For a more flexible string generation, one can annotate a column as a formatted string, give it a pattern (in form of regular expression), and the software will generate random strings based on the given regular expression. For example, pattern=""[A-Z]{4}-[0-9]{3}-[A-Z0-9]{6}"" may generate ""KWNF-971-K20X8B"" or any other string that follows that pattern.

The entity annotation guide explains how you can annotate your attributes to apply fake data instead of synthetic data.

The product has its own catalogs of “fake” values. The following entities are available by default and others can be added on-demand:

  1. Personal information such as title, gender, full_name, first_name, last_name, email, username, password, mobile_number, home_number, work_number (configured by default for English names but available in different locales)

  2. Addresses fields (such as postcode, county, city, district, street, house, flat, house_name, full_address) can be annotated and generated. Available on-demand, Synthesized can provide a dictionary of real addresses, so that the generated addresses are real (although still independent from original dataset).

  3. Bank account information (bic, sort_code, account, IBAN, credit cards)

  4. Any other entity can be generated by the user just by providing a pattern (in form of regular expression) - For example, the user can use the software to generate European car plates by providing the pattern “[0-9]{4}-[A-Z]{3}“, that may generate values like “3816-NXY”, “0981-ABX” or “2607-BCM”."