Entity Annotation
Configuring an entity annotation is necessary to generate realistic fake personally identifiable information (PII) such as customer names and addresses. Synthesized does not currently automatically recognize fields that contain PII, and therefore the default behaviour will be to generate the original data from such fields. |
Tabular datasets often contain fields that when combined can describe a specific entity, such as a unique person or postal address. For example, consider the dataset below that contains customer PII
title | first_name | last_name | gender | amount | |
---|---|---|---|---|---|
Mr |
John |
Doe |
Male |
101.2 |
|
Mrs |
Jane |
Smith |
Female |
28.2 |
|
Dr |
Albert |
Taylor |
Male |
98.1 |
|
Ms |
Alice |
Smart |
Female |
150.3 |
The combination of ('title'
, 'first_name'
, 'last_name'
, 'gender'
, 'email'
)
describes a unique person in this data, and there are strict relationships
between these attributes. E.g When "title" is "Mrs" or "Ms" then "first_name"
will most likely contain a name given to females.
When it is important to maintain the correct description of an entity in the generated synthetic data, the dataset must be annotated to link the appropriate fields together. When annotated, Synthesized will learn to generate realistic entities as a whole, rather than independently generating individual attributes.
Currently, Synthesized can handle the following entities:
-
Person. Labels such as title, gender, full_name, first_name, last_name, email, username, password, mobile_number, home_number, work_number can be annotated and generated.
-
Address. Labels such as postcode, county, city, district, street, house, flat, house_name, full_address can be annotated and generated.
-
Bank Account. Labels such as bic, sort_code, account can be annotated and generated.
-
Company. Labels such as full_name, name, country, suffix, locales can be annotated and generated.
-
Formatted String. For a more flexible string generation, one can annotate a column as a formatted string, give it a pattern (in form of regular expression), and the software will generate random strings based on the given regular expression. For example,
pattern="[A-Z]{4}-[0-9]{3}-[A-Z0-9]{6}"
may generate"KWNF-971-K20X8B"
or any other string that follows that pattern.
Person
Generating synthetic PII for individuals in a dataset can be achieved by
defining a Person
annotation. The Person model will intelligently handle the generation of the fields provided, ensuring consistency across gender, language, and country for a given row in the synthetic dataset.
The columns of a dataset that relate to the attributes of a person are
specified using PersonLabels
. This is used to define the Person
attributes that synthesized
will generate. The PersonLabels
can contain the following attributes:
-
title
: Name of column containing title (e.g "Mr", "Mrs"). -
gender
: Name of column containing genders (e.g Male, Female, Non-binary). -
fullname
: Name of column containing full names. -
firstname
: Name of column containing first names. -
lastname
: Name of column containing last names. -
email
: Name of column containing email addresses. -
username
: Name of column containing usernames. -
password
: Name of column containing passwords. -
mobile_number
: Name of column containing mobile telephone numbers. -
home_number
: Name of column containing house telephone numbers. -
work_number
: Name of column containing work telephone numbers. -
country
: Name of column containing country names or country codes (e.g. Spain or ES).
from synthesized import MetaExtractor, HighDimSynthesizer
from synthesized.metadata.value import Person
from synthesized.config import PersonLabels
person = Person(
name='person',
labels=PersonLabels(
gender='gender',
title='title',
firstname='first_name',
lastname='last_name',
email='email'
)
)
df_meta = MetaExtractor.extract(df=data, annotations=[person])
It is possible to define multiple Person annotations if a dataset contains PII columns for more than one person. These must be created as separate Person objects with unique names, and then passed to the list of annotations, e.g.
|
Locale/Language and Country
The Person
annotation can be configured to generate names and emails that correspond to a specific language or country. This can be done in two ways:
-
By setting the
locales
argument in thePerson
annotation. This can be a single locale (e.g. "en_GB") or a list of locales to sample from (e.g. ["en_GB", "de_DE", "fr_FR"]). -
Using a country column in the dataset containing country names or country codes (e.g. "Spain" or "ES"). The
country
label in thePersonLabels
class must be set to label this column.
When a locale is set, the generated names and emails will correspond to that locale. If a country label is set, the generated names and emails will correspond to the country specified in the column.
The |
For the Person
annotation, the available locales will depend on the python version and associated installed packages. To list the suppported country names, country codes and locales, the following utility functions can be used:
from synthesized.model.models.person import all_supported_country_codes
all_supported_country_codes()
# Dictionary mapping country codes to locales
{'AR': ['es_AR'],
'AM': ['hy_AM'],
'AT': ['de_AT'],
'AZ': ['az_AZ'],
'BE': ['nl_BE', 'fr_BE'],...
}
from synthesized.model.models.person import all_supported_country_names
all_supported_country_names()
# Dictionary mapping country codes to associated locales
{'argentina': ['es_AR'],
'armenia': ['hy_AM'],
'austria': ['de_AT'],
'azerbaijan': ['az_AZ'],
'belgium': ['nl_BE', 'fr_BE'],...
}
from synthesized.model.models.person import all_supported_locales
all_supported_locales()
# List of all supported locales
['ar_AA',
'ar_PS',
'ar_SA',
'az_AZ',
'bg_BG',...
]
A minimal set of country name, code, and locale that are available in all versions is given below
Minimal Supported Country Names, Codes and Locales
Country Code |
Country Names (capitalization ignored) |
Associated Locales |
AR |
argentina & argentine republic |
es_AR |
AM |
republic of armenia & armenia |
hy_AM |
AT |
republic of austria & austria |
de_AT |
AZ |
republic of azerbaijan & azerbaijan |
az_AZ |
BE |
kingdom of belgium & belgium |
nl_BE, fr_BE |
BD |
people’s republic of bangladesh & bangladesh |
bn_BD |
BG |
bulgaria & republic of bulgaria |
bg_BG |
BR |
brazil & federative republic of brazil |
pt_BR |
CA |
canada |
fr_CA |
CH |
switzerland & swiss confederation |
de_CH, fr_CH |
CL |
republic of chile & chile |
es_CL |
CN |
people’s republic of china & china |
zh_CN |
CO |
republic of colombia & colombia |
es_CO |
CZ |
czech republic & czechia |
cs_CZ |
DE |
federal republic of germany & germany |
de_DE |
DK |
denmark & kingdom of denmark |
da_DK |
ES |
spain & kingdom of spain |
es_ES |
EE |
republic of estonia & estonia |
et_EE |
FI |
finland & republic of finland |
fi_FI |
FR |
france & french republic |
fr_FR |
GB |
united kingdom of great britain and northern ireland & uk & united kingdom |
en_GB |
GE |
georgia |
ka_GE |
GR |
greece & hellenic republic |
el_GR |
HR |
croatia & republic of croatia |
hr_HR |
HU |
hungary |
hu_HU |
ID |
republic of indonesia & indonesia |
id_ID |
IN |
republic of india & india |
hi_IN, en_IN |
IE |
ireland |
en_IE, ga_IE |
IR |
islamic republic of iran & iran & iran, islamic republic of |
fa_IR |
IL |
israel & state of israel |
he_IL |
IT |
italian republic & italy |
it_IT |
JP |
japan |
ja_JP |
KR |
south korea & korea, republic of |
ko_KR |
LT |
republic of lithuania & lithuania |
lt_LT |
LV |
republic of latvia & latvia |
lv_LV |
MX |
united mexican states & mexico |
es_MX |
NL |
kingdom of the netherlands & netherlands |
nl_NL |
NO |
kingdom of norway & norway |
no_NO |
NP |
nepal & federal democratic republic of nepal |
ne_NP |
NZ |
new zealand |
en_NZ |
PL |
poland & republic of poland |
pl_PL |
PT |
portuguese republic & portugal |
pt_PT |
PS |
the state of palestine & palestine, state of |
ar_PS |
RO |
romania |
ro_RO |
RU |
russian federation & russia |
ru_RU |
SA |
kingdom of saudi arabia & saudi arabia |
ar_SA |
SI |
republic of slovenia & slovenia |
sl_SI |
SE |
sweden & kingdom of sweden |
sv_SE |
TH |
kingdom of thailand & thailand |
th_TH |
TR |
turkey & republic of türkiye & türkiye |
tr_TR |
UA |
ukraine |
uk_UA |
US |
united states & united states of america |
en_US |
Titles, genders and phone number attributes are currently not affected by the locale or country settings. |
synthesizer = HighDimSynthesizer(df_meta=df_meta)
synthesizer.learn(...)
df_synthesized = synthesizer.synthesize(num_rows=...)
There are 4 special configuration cases for this model that should be considered:
-
The
gender
label is present: the distribution of genders will be learnt from this column. -
No
gender
label present buttitle
label is present: The distribution for genders will be inferred from the titles column. -
Both
gender
andtitle
labels are present: The distribution will be learnt from thegender
column and values in thetitle
column will align with this. -
Neither
gender
nortitle
are present: The gender distribution will be assumed to be equal male and female.
Example 1: Using the country
label
In this example, we will generate synthetic consistent data for a dataset containing PII columns for individuals using a country
label. The dataset contains columns for title
, gender
, first_name
, last_name
, email
, and country
. We will use the country
label to generate names and emails that correspond to the country specified in the dataset.
import pandas as pd
from synthesized.metadata.value import Person
from synthesized.config import PersonLabels
from synthesized import MetaExtractor, HighDimSynthesizer
orig_df = pd.DataFrame({
"title": ["Mr", "Ms", "Mrs"],
"gender": ["M", "F", "F"],
"firstname": ["John", "Jane", "Alice"],
"lastname": ["Smith", "Doe", "Smith"],
"email": [
"j.smith@gmail.com",
"jane.doe@blah.com",
"A.S.Smith@synth.io",
],
"country": ["US", "GB", "FR"],
})
person = Person(
name='person',
labels=PersonLabels(
title="title",
gender="gender",
firstname="firstname",
lastname="lastname",
email="email",
country="country",
)
)
df_meta = MetaExtractor.extract(df=orig_df, annotations=[person])
synthesizer = HighDimSynthesizer(df_meta=df_meta)
synthesizer.learn(orig_df)
df_synthesized = synthesizer.synthesize(num_rows=10)
df_synthesized
title gender firstname lastname email country 0 Mr M Peter Harvey peter.harvey14@davies.net GB 1 Ms F Audrey Couturier audrey_couturier37@bonneau.com FR 2 Mr M Joshua Marquez joshua.marquez77@mckinney.com US 3 Mr M Ashley Hall ashley_hall@ingram.com GB 4 Mrs F Melanie Bryant melaniebryant14@bruce.biz GB [5 rows x 6 columns]
In the generated data, the names and emails correspond to the countries specified in the dataset. The distribution of countries in the original dataset will be preserved in the synthetic data.
Example 2: Using the locales
argument
In this example, we will generate consistent synthetic data for a dataset containing PII columns for individuals using the locales
argument. The dataset contains columns for title
, gender
, first_name
, last_name
, and email
. We will use the locales
argument to generate names and emails that correspond to the specified locales.
import pandas as pd
from synthesized.metadata.value import Person
from synthesized.config import PersonLabels
from synthesized import MetaExtractor, HighDimSynthesizer
orig_df = pd.DataFrame({
"title": ["Mr", "Ms", "Mrs"],
"gender": ["M", "F", "F"],
"firstname": ["John", "Jane", "Alice"],
"lastname": ["Smith", "Doe", "Smith"],
"email": [
"j.smith@gmail.com",
"jane.doe@blah.com",
"A.S.Smith@synth.io",
],
})
person = Person(
name='person',
labels=PersonLabels(
title="title",
gender="gender",
firstname="firstname",
lastname="lastname",
email="email",
),
locales=["ru_RU", "ja_JP", "it_IT"]
)
df_meta = MetaExtractor.extract(df=orig_df, annotations=[person])
synthesizer = HighDimSynthesizer(df_meta=df_meta)
synthesizer.learn(orig_df)
df_synthesized = synthesizer.synthesize(num_rows=5)
df_synthesized
title gender firstname lastname email 0 Mr M Леонтий Рыбаков леонтий.рыбаков1@yakusheva.biz 1 Ms F Serafina Balbi serafina_balbi30@sagese.it 2 Mr M 稔 山口 稔.山口76@kato.jp 3 Mr M Francesco Foa francesco_foa@collodi.eu 4 Mr M Pierluigi Roero pierluigi.roero12@bertoni.it [5 rows x 6 columns]
In the generated data, the names and emails correspond to the specified locales. The proportion of each locale will be approximately equal. As can be seen in the generated data we have names and emails that correspond to Russian, Japanese, and Italian locales as specified in the locales
argument.
Address
Similarly, an Address
annotation allows
Synthesized to generate fake address details. Currently, only UK addresses can
be generated.
from synthesized.metadata.value import Address
from synthesized.config import AddressLabels
The columns of a dataset that relate to the attributes of an address are
specified using AddressLabels
.
address = Address(
name='address',
labels=AddressLabels(
postcode='postcode',
county='county',
city='city',
district='district',
street='street_name',
house_number='house_number'
)
)
df_meta = MetaExtractor.extract(df=data, annotations=[address])
AddressModel
AddressModel
models addresses. It uses
Address
meta, which represents columns
with different address labels, such as city, house_number, postcode,
full_address, etc., to capture all the information needed to recreate similar
synthetic data.
AddressModelConfig
can also be provided as a part of the initialization.
AddressModelConfig
contains information, such as whether or not an address
file is provided, or if the postcodes need to be learned for address synthesis.
|
E.g. for postcode "EC2A 2DP":
|
from synthesized.metadata.value import Address
from synthesized.config import AddressLabels, AddressModelConfig
from synthesized.model.models import AddressModel
config = AddressModelConfig(addresses_file=None, learn_postcodes=False)
df = pd.DataFrame({
'postcode': ["" for _ in range(10)],
'street': ["" for _ in range(10)],
'full_address': ["" for _ in range(10)],
'city': ["" for _ in range(10)]
})
annotations = [Address(
name='Address',
nan_freq=0.3,
labels=AddressLabels(
postcode='postcode', city='city',
street='street', full_address='full_address'
)
)]
meta = MetaExtractor.extract(df, annotations=annotations)
model = AddressModel(meta['Address'], config=config)
model.meta.revert_df_from_children(df)
model.fit(df)
model.sample(n=3)
from faker import Faker
from synthesized.metadata.value import Address
from synthesized.config import AddressLabels, AddressModelConfig
from synthesized.model.models import AddressModel
address_file_path = 'data/addresses.jsonl.gz'
config = AddressModelConfig(addresses_file=address_file_path, learn_postcodes=True)
fkr = Faker('en_GB')
df = pd.DataFrame({
'postcode': [fkr.postcode() for _ in range(10)],
'street': [fkr.street_name() for _ in range(10)],
'full_address': [fkr.address() for _ in range(10)],
'city': [fkr.city() for _ in range(10)]
})
annotations = [Address(name='Address', nan_freq=0.3,
labels=AddressLabels(postcode='postcode', city='city',
street='street', full_address='full_address'))]
meta = MetaExtractor.extract(df, annotations=annotations)
model = AddressModel(meta['Address'], config=config)
model.meta.revert_df_from_children(df)
model.fit(df)
model.sample(n=3)
Bank
Defining a Bank
annotation allows
Synthesized to generate fake bank account numbers and sort codes. Currently,
Synthesized can only generate 8-digit account numbers and 6-digit sort codes.
from synthesized.metadata.value import Bank
from synthesized.config import BankLabels
The columns of a dataset that relate to the bank account attributes are
specified using BankLabels
.
bank = Bank(
name='bank',
labels=BankLabels(
sort_code='sort_code',
account='account_number'
)
)
Company
Defining a Company
annotation allows Synthesized to generate fake company entities.
Below are some examples of how to use the Company annotation.
We start with a simple data frame (the contents of company the company names can be anything, as long as they are strings):
import pandas as pd
df = pd.DataFrame({
'company_name': ['Apple Inc.', 'Microsoft Corporation', 'Google LLC'],
'employee_count': [10000, 50000, 100000],
})
df
company_name | employee_count |
---|---|
Apple Inc. |
10000 |
Microsoft Corporation |
50000 |
Google LLC |
100000 |
First we specify a locale to generate the company names for. The default locale is en_GB.
from synthesized import HighDimSynthesizer
from synthesized.metadata.value import Company
from synthesized.config import CompanyLabels
# Set the locale of the company annotation to `locales`.
locales = ["en_GB"]
ann = Company(
'company_annotation',
labels=CompanyLabels(full_name='company_name'),
locales=locales
)
# Create the synthesizer and synthesize.
synthesizer = HighDimSynthesizer.from_df(df, annotations=[ann])
synthesizer.fit(df)
synthesizer.sample(3)
company_name | employee_count |
---|---|
Thomson LLC |
450 |
Harris Group |
800000 |
Malder Ltd. |
100000 |
It is also possible to specify multiple locales for the company annotation. This can be done by passing a list of locales to the
|
In order to generate company names along with their countries, we can use the country label when creating the annotation meta. Note that the original dataset must also contain a "country" column (even if the values are all "".
ann = Company(
'company_annotation',
labels=CompanyLabels(full_name='company_name', country='countries'),
locales=locales
)
company_name | employee_count | countries |
---|---|---|
Thomson LLC |
450 |
United Kingdom |
Sager GmbH |
800000 |
Germany |
Poirier SA |
100000 |
France |
FormattedString
A FormattedString
annotation can be used
to generate synthetic data that conforms to a given regular expression, e.g
social security numbers, or customer account numbers that have a specific
format.
from synthesized.metadata.value.categorical import FormattedString
The FormattedString
is defined by passing the respective column name,
and a regex pattern:
regex = "^(?!666|000|9\\d{2})\\d{3}-(?!00)\\d{2}-(?!0{4})\\d{4}$";
social_security = FormattedString(name="social_security_number", pattern=regex)
df_meta = MetaExtractor.extract(df=data, annotations=[social_security])