Entity Annotation

Configuring an entity annotation is necessary to generate realistic fake personally identifiable information (PII) such as customer names and addresses. Synthesized does not currently automatically recognize fields that contain PII, and therefore the default behaviour will be to generate the original data from such fields.

Tabular datasets often contain fields that when combined can describe a specific entity, such as a unique person or postal address. For example, consider the dataset below that contains customer PII

title first_name last_name gender email amount

Mr

John

Doe

Male

john.doe@gmail.com

101.2

Mrs

Jane

Smith

Female

jane.smith@gmail.com

28.2

Dr

Albert

Taylor

Male

albert.taylor@aol.com

98.1

Ms

Alice

Smart

Female

alice.smart@hotmail.com

150.3

The combination of ('title', 'first_name', 'last_name', 'gender', 'email') describes a unique person in this data, and there are strict relationships between these attributes. E.g When "title" is "Mrs" or "Ms" then "first_name" will most likely contain a name given to females.

When it is important to maintain the correct description of an entity in the generated synthetic data, the dataset must be annotated to link the appropriate fields together. When annotated, Synthesized will learn to generate realistic entities as a whole, rather than independently generating individual attributes.

Currently, Synthesized can handle the following entities:

  1. Person. Labels such as title, gender, full_name, first_name, last_name, email, username, password, mobile_number, home_number, work_number can be annotated and generated.

  2. Address. Labels such as postcode, county, city, district, street, house, flat, house_name, full_address can be annotated and generated.

  3. Bank Account. Labels such as bic, sort_code, account can be annotated and generated.

  4. Company. Labels such as full_name, name, country, suffix, locales can be annotated and generated.

  5. Formatted String. For a more flexible string generation, one can annotate a column as a formatted string, give it a pattern (in form of regular expression), and the software will generate random strings based on the given regular expression. For example, pattern="[A-Z]{4}-[0-9]{3}-[A-Z0-9]{6}" may generate "KWNF-971-K20X8B" or any other string that follows that pattern.

Person

Generating synthetic PII for individuals in a dataset can be achieved by defining a Person annotation.

from synthesized import MetaExtractor, HighDimSynthesizer
from synthesized.metadata.value import Person
from synthesized.config import PersonLabels

The columns of a dataset that relate to the attributes of a person are specified using PersonLabels. This is used to define the Person values that synthesized can then generate.

person = Person(
    name='person',
    labels=PersonLabels(
        gender='gender',
        title='title',
        firstname='first_name',
        lastname='last_name',
        email='email'
    )
)
df_meta = MetaExtractor.extract(df=data, annotations=[person])

It is possible to define multiple Person annotations if a dataset contains PII columns for more than one person. These must be created as separate Person objects with unique names, and then passed to the list of annotations, e.g.

MetaExtractor.extract(df=…​, annotations=[person_1, person_2])

synthesizer = HighDimSynthesizer(df_meta=df_meta)
synthesizer.learn(...)
df_synthesized = synthesizer.synthesize(num_rows=...)

PersonModel

PersonModel encapsulates the attributes of a person. When paired with a Person Meta, they are able to understand and learn about the attributes that define a person and then generate data from their learned understanding. PersonModel captures gender using a Gender model internally and can be used to create the following attributes:

  • gender (orig.)

  • title (orig.)

  • first_name

  • last_name

  • email

  • username

  • password

  • home_number

  • work_number

  • mobile_number

Attributes marked with 'orig.' have values that correspond to the original dataset. The rest are intelligently generated based on the hidden model for the hidden attribute, _gender = {"F", "M", "NB", "A"}.

There are 3 special configuration cases for this model that should be considered:

  1. The attribute gender is present: In this case, the hidden model for _gender is based directly on the gender attribute. All values in the gender attribute should correspond to "F", "M", "U" or <NA>. In other words, there should be no ambiguous values in the collection "A".

  2. No gender present but title is present: The hidden model for _gender can be based on the available titles. As this is not a direct correspondence, not all values will correspond to a single collection. In other words, there may be some ambiguous values in the collection "A".

  3. Neither gender nor title are present: The hidden model for gender cannot be fitted to the data and so the _gender attribute is assumed to be evenly distributed amongst the genders specified in the config.

PersonModel can be provided PersonModelConfig during initialization. 'person_locale' is a member variable of the PersonModelConfig class which can be set to specify the locality of the people.

E.g. person_locale = 'ru_RU' will refer to people belonging to Russia

This can be quite useful to synthesize details of people belonging to a particular locality.

import pandas as pd
import numpy as np
from synthesized.metadata.factory import MetaExtractor
from synthesized.config import PersonModelConfig, PersonLabels
from synthesized.metadata.value import Person
from synthesized.model.models import PersonModel

meta = Person('person', labels=PersonLabels(title='title', gender='gender', fullname='name',
                                firstname='firstname', lastname='lastname'))
person_model_config = PersonModelConfig()
person_model_config.person_locale='zh_CN'
model = PersonModel(meta=meta, config=person_model_config)
df = pd.DataFrame({
    'gender': np.random.choice(['m', 'f', 'u'], size=100),
    'title': np.random.choice(['mr', 'mr.', 'mx', 'miss', 'Mrs'], size=100)
})
df[[c for c in model.params.values() if c not in df.columns]] = 'test'

model.meta.revert_df_from_children(df)
model.fit(df)
model.sample(3)

Address

Similarly, an Address annotation allows Synthesized to generate fake address details. Currently, only UK addresses can be generated.

from synthesized.metadata.value import Address
from synthesized.config import AddressLabels

The columns of a dataset that relate to the attributes of an address are specified using AddressLabels.

address = Address(
    name='address',
    labels=AddressLabels(
        postcode='postcode',
        county='county',
        city='city',
        district='district',
        street='street_name',
        house_number='house_number'
    )
)
df_meta = MetaExtractor.extract(df=data, annotations=[address])

AddressModel

AddressModel models addresses. It uses Address meta, which represents columns with different address labels, such as city, house_number, postcode, full_address, etc., to capture all the information needed to recreate similar synthetic data.

AddressModelConfig can also be provided as a part of the initialization. AddressModelConfig contains information, such as whether or not an address file is provided, or if the postcodes need to be learned for address synthesis.

AddressModel uses PostcodeModel to learn and synthesize the addresses. If an address file is provided then the addresses corresponding to the learned postcodes are sampled from the file. If an address file is not provided, then the Faker is used to generate addresses.

AddressModel class has a member variable 'postcode_level' which provides the flexibility to use a partial or full postcode for fitting and sampling.

E.g. for postcode "EC2A 2DP":

postcode_level=0 will signify "EC"

postcode_level=1 will signify "EC2A"

postcode_level=2 will signify "EC2A 2DP"

Without address file
from synthesized.metadata.value import Address
from synthesized.config import AddressLabels, AddressModelConfig
from synthesized.model.models import AddressModel

config = AddressModelConfig(addresses_file=None, learn_postcodes=False)
df = pd.DataFrame({
    'postcode': ["" for _ in range(10)],
    'street': ["" for _ in range(10)],
    'full_address': ["" for _ in range(10)],
    'city': ["" for _ in range(10)]
})

annotations = [Address(
    name='Address',
    nan_freq=0.3,
    labels=AddressLabels(
        postcode='postcode', city='city',
        street='street', full_address='full_address'
    )
)]
meta = MetaExtractor.extract(df, annotations=annotations)
model = AddressModel(meta['Address'], config=config)
model.meta.revert_df_from_children(df)
model.fit(df)
model.sample(n=3)
With address file
from faker import Faker
from synthesized.metadata.value import Address
from synthesized.config import AddressLabels, AddressModelConfig
from synthesized.model.models import AddressModel

address_file_path = 'data/addresses.jsonl.gz'
config = AddressModelConfig(addresses_file=address_file_path, learn_postcodes=True)
fkr = Faker('en_GB')
df = pd.DataFrame({
    'postcode': [fkr.postcode() for _ in range(10)],
    'street': [fkr.street_name() for _ in range(10)],
    'full_address': [fkr.address() for _ in range(10)],
    'city': [fkr.city() for _ in range(10)]
})

annotations = [Address(name='Address', nan_freq=0.3,
                labels=AddressLabels(postcode='postcode', city='city',
                                    street='street', full_address='full_address'))]
meta = MetaExtractor.extract(df, annotations=annotations)
model = AddressModel(meta['Address'], config=config)
model.meta.revert_df_from_children(df)
model.fit(df)
model.sample(n=3)

Bank

Defining a Bank annotation allows Synthesized to generate fake bank account numbers and sort codes. Currently, Synthesized can only generate 8-digit account numbers and 6-digit sort codes.

from synthesized.metadata.value import Bank
from synthesized.config import BankLabels

The columns of a dataset that relate to the bank account attributes are specified using BankLabels.

bank = Bank(
    name='bank',
    labels=BankLabels(
        sort_code='sort_code',
        account='account_number'
    )
)

Company

Defining a Company annotation allows Synthesized to generate fake company entities.

Below are some examples of how to use the Company annotation.

We start with a simple data frame (the contents of company the company names can be anything, as long as they are strings):

import pandas as pd

df = pd.DataFrame({
	'company_name': ['Apple Inc.', 'Microsoft Corporation', 'Google LLC'],
	'employee_count': [10000, 50000, 100000],
})
df
company_name employee_count

Apple Inc.

10000

Microsoft Corporation

50000

Google LLC

100000

First we specify a locale to generate the company names for. The default locale is en_GB.

from synthesized import HighDimSynthesizer
from synthesized.metadata.value import Company
from synthesized.config import CompanyLabels

# Set the locale of the company annotation to `locales`.
locales = ["en_GB"]

ann = Company(
    'company_annotation',
    labels=CompanyLabels(full_name='company_name'),
    locales=locales
)

# Create the synthesizer and synthesize.
synthesizer = HighDimSynthesizer.from_df(df, annotations=[ann])
synthesizer.fit(df)
synthesizer.sample(3)
company_name employee_count

Thomson LLC

450

Harris Group

800000

Malder Ltd.

100000

It is also possible to specify multiple locales for the company annotation. This can be done by passing a list of locales to the locales parameter of the CompanyLabels class.

locales = ["en_GB", "de_DE", "fr_FR"]

In order to generate company names along with their countries, we can use the country label when creating the annotation meta. Note that the original dataset must also contain a "country" column (even if the values are all "".

ann = Company(
    'company_annotation',
    labels=CompanyLabels(full_name='company_name', country='countries'),
    locales=locales
)
company_name employee_count countries

Thomson LLC

450

United Kingdom

Sager GmbH

800000

Germany

Poirier SA

100000

France

FormattedString

A FormattedString annotation can be used to generate synthetic data that conforms to a given regular expression, e.g social security numbers, or customer account numbers that have a specific format.

from synthesized.metadata.value.categorical import FormattedString

The FormattedString is defined by passing the respective column name, and a regex pattern:

regex = "^(?!666|000|9\\d{2})\\d{3}-(?!00)\\d{2}-(?!0{4})\\d{4}$";
social_security = FormattedString(name="social_security_number", pattern=regex)

df_meta = MetaExtractor.extract(df=data, annotations=[social_security])