Entity Annotation

Configuring an entity annotation is necessary to generate realistic fake personally identifiable information (PII) such as customer names and addresses. Synthesized does not currently automatically recognize fields that contain PII, and therefore the default behaviour will be to generate the original data from such fields.

Tabular datasets often contain fields that when combined can describe a specific entity, such as a unique person or postal address. For example, consider the dataset below that contains customer PII

title	first_name	last_name	gender	email	amount
Mr	John	Doe	Male	john.doe@gmail.com	101.2
Mrs	Jane	Smith	Female	jane.smith@gmail.com	28.2
Dr	Albert	Taylor	Male	albert.taylor@aol.com	98.1
Ms	Alice	Smart	Female	alice.smart@hotmail.com	150.3

title

first_name

last_name

gender

amount

John

Doe

Male

john.doe@gmail.com

101.2

Mrs

Jane

Smith

Female

jane.smith@gmail.com

28.2

Albert

Taylor

Male

albert.taylor@aol.com

98.1

Alice

Smart

Female

alice.smart@hotmail.com

150.3

The combination of ('title', 'first_name', 'last_name', 'gender', 'email') describes a unique person in this data, and there are strict relationships between these attributes. E.g When "title" is "Mrs" or "Ms" then "first_name" will most likely contain a name given to females.

When it is important to maintain the correct description of an entity in the generated synthetic data, the dataset must be annotated to link the appropriate fields together. When annotated, Synthesized will learn to generate realistic entities as a whole, rather than independently generating individual attributes.

Currently, Synthesized can handle the following entities:

Person. Labels such as title, gender, full_name, first_name, last_name, email, username, password, mobile_number, home_number, work_number can be annotated and generated.
Address. Labels such as postcode, county, city, district, street, house, flat, house_name, full_address can be annotated and generated.
Bank Account. Labels such as bic, sort_code, account can be annotated and generated.
Company. Labels such as full_name, name, country, suffix, locales can be annotated and generated.
Formatted String. For a more flexible string generation, one can annotate a column as a formatted string, give it a pattern (in form of regular expression), and the software will generate random strings based on the given regular expression. For example, pattern="[A-Z]{4}-[0-9]{3}-[A-Z0-9]{6}" may generate "KWNF-971-K20X8B" or any other string that follows that pattern.

Person

Generating synthetic PII for individuals in a dataset can be achieved by defining a Person annotation. The Person model will intelligently handle the generation of the fields provided, ensuring consistency across gender, language, and country for a given row in the synthetic dataset.

The columns of a dataset that relate to the attributes of a person are specified using PersonLabels. This is used to define the Person attributes that synthesized will generate. The PersonLabels can contain the following attributes:

title: Name of column containing title (e.g "Mr", "Mrs").
gender: Name of column containing genders (e.g Male, Female, Non-binary).
fullname: Name of column containing full names.
firstname: Name of column containing first names.
lastname: Name of column containing last names.
email: Name of column containing email addresses.
username: Name of column containing usernames.
password: Name of column containing passwords.
mobile_number: Name of column containing mobile telephone numbers.
home_number: Name of column containing house telephone numbers.
work_number: Name of column containing work telephone numbers.
country: Name of column containing country names or country codes (e.g. Spain or ES).

from synthesized import MetaExtractor, HighDimSynthesizer
from synthesized.metadata.value import Person
from synthesized.config import PersonLabels

person = Person(
    name='person',
    labels=PersonLabels(
        gender='gender',
        title='title',
        firstname='first_name',
        lastname='last_name',
        email='email'
    )
)

df_meta = MetaExtractor.extract(df=data, annotations=[person])

It is possible to define multiple Person annotations if a dataset contains PII columns for more than one person. These must be created as separate Person objects with unique names, and then passed to the list of annotations, e.g.

MetaExtractor.extract(df=…, annotations=[person_1, person_2])

Locale/Language and Country

The Person annotation can be configured to generate names and emails that correspond to a specific language or country. This can be done in two ways:

By setting the locales argument in the Person annotation. This can be a single locale (e.g. "en_GB") or a list of locales to sample from (e.g. ["en_GB", "de_DE", "fr_FR"]).
Using a country column in the dataset containing country names or country codes (e.g. "Spain" or "ES"). The country label in the PersonLabels class must be set to label this column.

When a locale is set, the generated names and emails will correspond to that locale. If a country label is set, the generated names and emails will correspond to the country specified in the column.

The country label and the locales argument are mutually exclusive and cannot be used together. Trying to use both will result in an error.

For the Person annotation, the following locales, country names and country codes are supported:

Supported Country Names, codes and locales

Country Code

Country Names (capitalization ignored)

Associated Locales

argentina & argentine republic

es_AR

republic of armenia & armenia

hy_AM

republic of austria & austria

de_AT

republic of azerbaijan & azerbaijan

az_AZ

kingdom of belgium & belgium

nl_BE, fr_BE

people’s republic of bangladesh & bangladesh

bn_BD

bulgaria & republic of bulgaria

bg_BG

brazil & federative republic of brazil

pt_BR

canada

fr_CA

switzerland & swiss confederation

de_CH, fr_CH

republic of chile & chile

es_CL

people’s republic of china & china

zh_CN

republic of colombia & colombia

es_CO

czech republic & czechia

cs_CZ

federal republic of germany & germany

de_DE

denmark & kingdom of denmark

da_DK

spain & kingdom of spain

es_ES

republic of estonia & estonia

et_EE

finland & republic of finland

fi_FI

france & french republic

fr_FR

united kingdom of great britain and northern ireland & uk & united kingdom

en_GB

georgia

ka_GE

greece & hellenic republic

el_GR

croatia & republic of croatia

hr_HR

hungary

hu_HU

republic of indonesia & indonesia

id_ID

republic of india & india

hi_IN, en_IN

ireland

en_IE, ga_IE

islamic republic of iran & iran & iran, islamic republic of

fa_IR

israel & state of israel

he_IL

italian republic & italy

it_IT

japan

ja_JP

south korea & korea, republic of

ko_KR

republic of lithuania & lithuania

lt_LT

republic of latvia & latvia

lv_LV

united mexican states & mexico

es_MX

kingdom of the netherlands & netherlands

nl_NL

kingdom of norway & norway

no_NO

nepal & federal democratic republic of nepal

ne_NP

new zealand

en_NZ

poland & republic of poland

pl_PL

portuguese republic & portugal

pt_PT

the state of palestine & palestine, state of

ar_PS

romania

ro_RO

russian federation & russia

ru_RU

kingdom of saudi arabia & saudi arabia

ar_SA

republic of slovenia & slovenia

sl_SI

sweden & kingdom of sweden

sv_SE

kingdom of thailand & thailand

th_TH

turkey & republic of türkiye & türkiye

tr_TR

ukraine

uk_UA

united states & united states of america

en_US

Titles, genders and phone number attributes are currently not affected by the locale or country settings.

synthesizer = HighDimSynthesizer(df_meta=df_meta)
synthesizer.learn(...)
df_synthesized = synthesizer.synthesize(num_rows=...)

There are 4 special configuration cases for this model that should be considered:

The gender label is present: the distribution of genders will be learnt from this column.
No gender label present but title label is present: The distribution for genders will be inferred from the titles column.
Both gender and title labels are present: The distribution will be learnt from the gender column and values in the title column will align with this.
Neither gender nor title are present: The gender distribution will be assumed to be equal male and female.

Example 1: Using the `country` label

In this example, we will generate synthetic consistent data for a dataset containing PII columns for individuals using a country label. The dataset contains columns for title, gender, first_name, last_name, email, and country. We will use the country label to generate names and emails that correspond to the country specified in the dataset.

import pandas as pd
from synthesized.metadata.value import Person
from synthesized.config import PersonLabels
from synthesized import MetaExtractor, HighDimSynthesizer

orig_df = pd.DataFrame({
        "title": ["Mr", "Ms", "Mrs"],
        "gender": ["M", "F", "F"],
        "firstname": ["John", "Jane", "Alice"],
        "lastname": ["Smith", "Doe", "Smith"],
        "email": [
            "j.smith@gmail.com",
            "jane.doe@blah.com",
            "A.S.Smith@synth.io",
        ],
        "country": ["US", "GB", "FR"],
    })

person = Person(
    name='person',
    labels=PersonLabels(
        title="title",
        gender="gender",
        firstname="firstname",
        lastname="lastname",
        email="email",
        country="country",
    )
)

df_meta = MetaExtractor.extract(df=orig_df, annotations=[person])

synthesizer = HighDimSynthesizer(df_meta=df_meta)
synthesizer.learn(orig_df)
df_synthesized = synthesizer.synthesize(num_rows=10)
df_synthesized

  title gender firstname   lastname                           email country
0    Mr      M     Peter     Harvey       peter.harvey14@davies.net      GB
1    Ms      F    Audrey  Couturier  audrey_couturier37@bonneau.com      FR
2    Mr      M    Joshua    Marquez   joshua.marquez77@mckinney.com      US
3    Mr      M    Ashley       Hall          ashley_hall@ingram.com      GB
4   Mrs      F   Melanie     Bryant       melaniebryant14@bruce.biz      GB

[5 rows x 6 columns]

In the generated data, the names and emails correspond to the countries specified in the dataset. The distribution of countries in the original dataset will be preserved in the synthetic data.

Example 2: Using the `locales` argument

In this example, we will generate consistent synthetic data for a dataset containing PII columns for individuals using the locales argument. The dataset contains columns for title, gender, first_name, last_name, and email. We will use the locales argument to generate names and emails that correspond to the specified locales.

import pandas as pd
from synthesized.metadata.value import Person
from synthesized.config import PersonLabels
from synthesized import MetaExtractor, HighDimSynthesizer

orig_df = pd.DataFrame({
        "title": ["Mr", "Ms", "Mrs"],
        "gender": ["M", "F", "F"],
        "firstname": ["John", "Jane", "Alice"],
        "lastname": ["Smith", "Doe", "Smith"],
        "email": [
            "j.smith@gmail.com",
            "jane.doe@blah.com",
            "A.S.Smith@synth.io",
        ],
    })

person = Person(
    name='person',
    labels=PersonLabels(
        title="title",
        gender="gender",
        firstname="firstname",
        lastname="lastname",
        email="email",
    ),
    locales=["ru_RU", "ja_JP", "it_IT"]
)

df_meta = MetaExtractor.extract(df=orig_df, annotations=[person])

synthesizer = HighDimSynthesizer(df_meta=df_meta)
synthesizer.learn(orig_df)
df_synthesized = synthesizer.synthesize(num_rows=5)
df_synthesized

  title gender  firstname lastname                           email
0    Mr      M    Леонтий  Рыбаков  леонтий.рыбаков1@yakusheva.biz
1    Ms      F   Serafina    Balbi      serafina_balbi30@sagese.it
2    Mr      M         稔      山口                稔.山口76@kato.jp
3    Mr      M  Francesco      Foa        francesco_foa@collodi.eu
4    Mr      M  Pierluigi    Roero    pierluigi.roero12@bertoni.it

[5 rows x 6 columns]

In the generated data, the names and emails correspond to the specified locales. The proportion of each locale will be approximately equal. As can be seen in the generated data we have names and emails that correspond to Russian, Japanese, and Italian locales as specified in the locales argument.

Address

Similarly, an Address annotation allows Synthesized to generate fake address details. Currently, only UK addresses can be generated.

from synthesized.metadata.value import Address
from synthesized.config import AddressLabels

The columns of a dataset that relate to the attributes of an address are specified using AddressLabels.

address = Address(
    name='address',
    labels=AddressLabels(
        postcode='postcode',
        county='county',
        city='city',
        district='district',
        street='street_name',
        house_number='house_number'
    )
)

df_meta = MetaExtractor.extract(df=data, annotations=[address])

AddressModel

AddressModel models addresses. It uses Address meta, which represents columns with different address labels, such as city, house_number, postcode, full_address, etc., to capture all the information needed to recreate similar synthetic data.

AddressModelConfig can also be provided as a part of the initialization. AddressModelConfig contains information, such as whether or not an address file is provided, or if the postcodes need to be learned for address synthesis.

AddressModel uses PostcodeModel to learn and synthesize the addresses. If an address file is provided then the addresses corresponding to the learned postcodes are sampled from the file. If an address file is not provided, then the Faker is used to generate addresses.

AddressModel class has a member variable 'postcode_level' which provides the flexibility to use a partial or full postcode for fitting and sampling.

E.g. for postcode "EC2A 2DP":

postcode_level=0 will signify "EC"

postcode_level=1 will signify "EC2A"

postcode_level=2 will signify "EC2A 2DP"

Without address file

from synthesized.metadata.value import Address
from synthesized.config import AddressLabels, AddressModelConfig
from synthesized.model.models import AddressModel

config = AddressModelConfig(addresses_file=None, learn_postcodes=False)
df = pd.DataFrame({
    'postcode': ["" for _ in range(10)],
    'street': ["" for _ in range(10)],
    'full_address': ["" for _ in range(10)],
    'city': ["" for _ in range(10)]
})

annotations = [Address(
    name='Address',
    nan_freq=0.3,
    labels=AddressLabels(
        postcode='postcode', city='city',
        street='street', full_address='full_address'
    )
)]
meta = MetaExtractor.extract(df, annotations=annotations)
model = AddressModel(meta['Address'], config=config)
model.meta.revert_df_from_children(df)
model.fit(df)
model.sample(n=3)

With address file

from faker import Faker
from synthesized.metadata.value import Address
from synthesized.config import AddressLabels, AddressModelConfig
from synthesized.model.models import AddressModel

address_file_path = 'data/addresses.jsonl.gz'
config = AddressModelConfig(addresses_file=address_file_path, learn_postcodes=True)
fkr = Faker('en_GB')
df = pd.DataFrame({
    'postcode': [fkr.postcode() for _ in range(10)],
    'street': [fkr.street_name() for _ in range(10)],
    'full_address': [fkr.address() for _ in range(10)],
    'city': [fkr.city() for _ in range(10)]
})

annotations = [Address(name='Address', nan_freq=0.3,
                labels=AddressLabels(postcode='postcode', city='city',
                                    street='street', full_address='full_address'))]
meta = MetaExtractor.extract(df, annotations=annotations)
model = AddressModel(meta['Address'], config=config)
model.meta.revert_df_from_children(df)
model.fit(df)
model.sample(n=3)

Bank

Defining a Bank annotation allows Synthesized to generate fake bank account numbers and sort codes. Currently, Synthesized can only generate 8-digit account numbers and 6-digit sort codes.

from synthesized.metadata.value import Bank
from synthesized.config import BankLabels

The columns of a dataset that relate to the bank account attributes are specified using BankLabels.

bank = Bank(
    name='bank',
    labels=BankLabels(
        sort_code='sort_code',
        account='account_number'
    )
)

Company

Defining a Company annotation allows Synthesized to generate fake company entities.

Below are some examples of how to use the Company annotation.

We start with a simple data frame (the contents of company the company names can be anything, as long as they are strings):

import pandas as pd

df = pd.DataFrame({
	'company_name': ['Apple Inc.', 'Microsoft Corporation', 'Google LLC'],
	'employee_count': [10000, 50000, 100000],
})
df

company_name	employee_count
Apple Inc.	10000
Microsoft Corporation	50000
Google LLC	100000

First we specify a locale to generate the company names for. The default locale is en_GB.

from synthesized import HighDimSynthesizer
from synthesized.metadata.value import Company
from synthesized.config import CompanyLabels

# Set the locale of the company annotation to `locales`.
locales = ["en_GB"]

ann = Company(
    'company_annotation',
    labels=CompanyLabels(full_name='company_name'),
    locales=locales
)

# Create the synthesizer and synthesize.
synthesizer = HighDimSynthesizer.from_df(df, annotations=[ann])
synthesizer.fit(df)
synthesizer.sample(3)

company_name	employee_count
Thomson LLC	450
Harris Group	800000
Malder Ltd.	100000

It is also possible to specify multiple locales for the company annotation. This can be done by passing a list of locales to the locales parameter of the CompanyLabels class.

locales = ["en_GB", "de_DE", "fr_FR"]

In order to generate company names along with their countries, we can use the country label when creating the annotation meta. Note that the original dataset must also contain a "country" column (even if the values are all "".

ann = Company(
    'company_annotation',
    labels=CompanyLabels(full_name='company_name', country='countries'),
    locales=locales
)

company_name	employee_count	countries
Thomson LLC	450	United Kingdom
Sager GmbH	800000	Germany
Poirier SA	100000	France

company_name

employee_count

countries

Thomson LLC

450

United Kingdom

Sager GmbH

800000

Germany

Poirier SA

100000

France

FormattedString

A FormattedString annotation can be used to generate synthetic data that conforms to a given regular expression, e.g social security numbers, or customer account numbers that have a specific format.

from synthesized.metadata.value.categorical import FormattedString

The FormattedString is defined by passing the respective column name, and a regex pattern:

regex = "^(?!666|000|9\\d{2})\\d{3}-(?!00)\\d{2}-(?!0{4})\\d{4}$";
social_security = FormattedString(name="social_security_number", pattern=regex)

df_meta = MetaExtractor.extract(df=data, annotations=[social_security])