Entity Annotation
Configuring an entity annotation is necessary to generate realistic fake personally identifiable information (PII) such as customer names and addresses. Synthesized does not currently automatically recognize fields that contain PII, and therefore the default behaviour will be to generate the original data from such fields. |
Tabular datasets often contain fields that when combined can describe a specific entity, such as a unique person or postal address. For example, consider the dataset below that contains customer PII
title | first_name | last_name | gender | amount | |
---|---|---|---|---|---|
Mr |
John |
Doe |
Male |
101.2 |
|
Mrs |
Jane |
Smith |
Female |
28.2 |
|
Dr |
Albert |
Taylor |
Male |
98.1 |
|
Ms |
Alice |
Smart |
Female |
150.3 |
The combination of ('title'
, 'first_name'
, 'last_name'
, 'gender'
, 'email'
)
describes a unique person in this data, and there are strict relationships
between these attributes. E.g When "title" is "Mrs" or "Ms" then "first_name"
will most likely contain a name given to females.
When it is important to maintain the correct description of an entity in the generated synthetic data, the dataset must be annotated to link the appropriate fields together. When annotated, Synthesized will learn to generate realistic entities as a whole, rather than independently generating individual attributes.
Currently, Synthesized can handle the following entities:
-
Person. Labels such as title, gender, full_name, first_name, last_name, email, username, password, mobile_number, home_number, work_number can be annotated and generated.
-
Address. Labels such as postcode, county, city, district, street, house, flat, house_name, full_address can be annotated and generated.
-
Bank Account. Labels such as bic, sort_code, account can be annotated and generated.
-
Company. Labels such as full_name, name, country, suffix, locales can be annotated and generated.
-
Formatted String. For a more flexible string generation, one can annotate a column as a formatted string, give it a pattern (in form of regular expression), and the software will generate random strings based on the given regular expression. For example,
pattern="[A-Z]{4}-[0-9]{3}-[A-Z0-9]{6}"
may generate"KWNF-971-K20X8B"
or any other string that follows that pattern.
Person
Generating synthetic PII for individuals in a dataset can be achieved by
defining a Person
annotation. The Person model will intelligently handle the generation of the fields provided, ensuring consistency across gender, language, and country for a given row in the synthetic dataset.
The columns of a dataset that relate to the attributes of a person are
specified using PersonLabels
. This is used to define the Person
attributes that synthesized
will generate. The PersonLabels
can contain the following attributes:
-
title
: Name of column containing title (e.g "Mr", "Mrs"). -
gender
: Name of column containing genders (e.g Male, Female, Non-binary). -
fullname
: Name of column containing full names. -
firstname
: Name of column containing first names. -
lastname
: Name of column containing last names. -
email
: Name of column containing email addresses. -
username
: Name of column containing usernames. -
password
: Name of column containing passwords. -
mobile_number
: Name of column containing mobile telephone numbers. -
home_number
: Name of column containing house telephone numbers. -
work_number
: Name of column containing work telephone numbers. -
country
: Name of column containing country names or country codes (e.g. Spain or ES).
from synthesized import MetaExtractor, HighDimSynthesizer
from synthesized.metadata.value import Person
from synthesized.config import PersonLabels
person = Person(
name='person',
labels=PersonLabels(
gender='gender',
title='title',
firstname='first_name',
lastname='last_name',
email='email'
)
)
df_meta = MetaExtractor.extract(df=data, annotations=[person])
It is possible to define multiple Person annotations if a dataset contains PII columns for more than one person. These must be created as separate Person objects with unique names, and then passed to the list of annotations, e.g.
|
Locale/Language and Country
The Person
annotation can be configured to generate names and emails that correspond to a specific language or country. This can be done in two ways:
-
By setting the
locales
argument in thePerson
annotation. This can be a single locale (e.g. "en_GB") or a list of locales to sample from (e.g. ["en_GB", "de_DE", "fr_FR"]). -
Using a country column in the dataset containing country names or country codes (e.g. "Spain" or "ES"). The
country
label in thePersonLabels
class must be set to label this column.
When a locale is set, the generated names and emails will correspond to that locale. If a country label is set, the generated names and emails will correspond to the country specified in the column.
The |
For the Person
annotation, the available locales will depend on the python version and associated installed packages. To list the suppported country names, country codes and locales, the following utility functions can be used:
from synthesized.model.models.person import all_supported_country_codes
all_supported_country_codes()
# Dictionary mapping country codes to locales
{'AR': ['es_AR'],
'AM': ['hy_AM'],
'AT': ['de_AT'],
'AZ': ['az_AZ'],
'BE': ['nl_BE', 'fr_BE'],...
}
from synthesized.model.models.person import all_supported_country_names
all_supported_country_names()
# Dictionary mapping country codes to associated locales
{'argentina': ['es_AR'],
'armenia': ['hy_AM'],
'austria': ['de_AT'],
'azerbaijan': ['az_AZ'],
'belgium': ['nl_BE', 'fr_BE'],...
}
from synthesized.model.models.person import all_supported_locales
all_supported_locales()
# List of all supported locales
['ar_AA',
'ar_PS',
'ar_SA',
'az_AZ',
'bg_BG',...
]
A minimal set of country name, code, and locale that are available in the Person annotation in all versions is given below
Minimal Supported Country Names, Codes and Locales
Country Code |
Country Names (capitalization ignored) |
Associated Locales |
AR |
argentina, argentine republic |
es_AR |
AM |
armenia, republic of armenia |
hy_AM |
AT |
austria, republic of austria |
de_AT |
AZ |
azerbaijan, republic of azerbaijan |
az_AZ |
BE |
belgium, kingdom of belgium |
nl_BE, fr_BE |
BD |
bangladesh, people’s republic of bangladesh |
bn_BD |
BG |
bulgaria, republic of bulgaria |
bg_BG |
BR |
brazil, federative republic of brazil |
pt_BR |
CA |
canada |
fr_CA |
CH |
switzerland, swiss confederation |
de_CH, fr_CH |
CL |
chile, republic of chile |
es_CL |
CN |
china, people’s republic of china |
zh_CN |
CO |
colombia, republic of colombia |
es_CO |
CZ |
czechia, czech republic |
cs_CZ |
DE |
germany, federal republic of germany |
de_DE |
DK |
denmark, kingdom of denmark |
da_DK |
ES |
spain, kingdom of spain |
es_ES |
EE |
estonia, republic of estonia |
et_EE |
FI |
finland, republic of finland |
fi_FI |
FR |
france, french republic |
fr_FR |
GB |
united kingdom, uk, united kingdom of great britain and northern ireland |
en_GB |
GE |
georgia |
ka_GE |
GR |
greece, hellenic republic |
el_GR |
HR |
croatia, republic of croatia |
hr_HR |
HU |
hungary |
hu_HU |
ID |
indonesia, republic of indonesia |
id_ID |
IN |
india, republic of india |
hi_IN, en_IN |
IE |
ireland |
en_IE, ga_IE |
IR |
iran, islamic republic of, iran, islamic republic of iran |
fa_IR |
IL |
israel, state of israel |
he_IL |
IT |
italy, italian republic |
it_IT |
JP |
japan |
ja_JP |
KR |
korea, republic of, south korea |
ko_KR |
LT |
lithuania, republic of lithuania |
lt_LT |
LV |
latvia, republic of latvia |
lv_LV |
MX |
mexico, united mexican states |
es_MX |
NL |
netherlands, kingdom of the netherlands |
nl_NL |
NO |
norway, kingdom of norway |
no_NO |
NP |
nepal, federal democratic republic of nepal |
ne_NP |
NZ |
new zealand |
en_NZ |
PL |
poland, republic of poland |
pl_PL |
PT |
portugal, portuguese republic |
pt_PT |
PS |
palestine, state of, the state of palestine |
ar_PS |
RO |
romania |
ro_RO |
RU |
russian federation, russia |
ru_RU |
SA |
saudi arabia, kingdom of saudi arabia |
ar_SA |
SI |
slovenia, republic of slovenia |
sl_SI |
SE |
sweden, kingdom of sweden |
sv_SE |
TH |
thailand, kingdom of thailand |
th_TH |
TR |
türkiye, turkey, republic of türkiye |
tr_TR |
UA |
ukraine |
uk_UA |
US |
united states, united states of america |
en_US |
Titles, genders and phone number attributes are currently not affected by the locale or country settings. |
synthesizer = HighDimSynthesizer(df_meta=df_meta)
synthesizer.learn(...)
df_synthesized = synthesizer.synthesize(num_rows=...)
There are 4 special configuration cases for this model that should be considered:
-
The
gender
label is present: the distribution of genders will be learnt from this column. -
No
gender
label present buttitle
label is present: The distribution for genders will be inferred from the titles column. -
Both
gender
andtitle
labels are present: The distribution will be learnt from thegender
column and values in thetitle
column will align with this. -
Neither
gender
nortitle
are present: The gender distribution will be assumed to be equal male and female.
Example 1: Using the country
label
In this example, we will generate synthetic consistent data for a dataset containing PII columns for individuals using a country
label. The dataset contains columns for title
, gender
, first_name
, last_name
, email
, and country
. We will use the country
label to generate names and emails that correspond to the country specified in the dataset.
import pandas as pd
from synthesized.metadata.value import Person
from synthesized.config import PersonLabels
from synthesized import MetaExtractor, HighDimSynthesizer
orig_df = pd.DataFrame({
"title": ["Mr", "Ms", "Mrs"],
"gender": ["M", "F", "F"],
"firstname": ["John", "Jane", "Alice"],
"lastname": ["Smith", "Doe", "Smith"],
"email": [
"j.smith@gmail.com",
"jane.doe@blah.com",
"A.S.Smith@synth.io",
],
"country": ["US", "GB", "FR"],
})
person = Person(
name='person',
labels=PersonLabels(
title="title",
gender="gender",
firstname="firstname",
lastname="lastname",
email="email",
country="country",
)
)
df_meta = MetaExtractor.extract(df=orig_df, annotations=[person])
synthesizer = HighDimSynthesizer(df_meta=df_meta)
synthesizer.learn(orig_df)
df_synthesized = synthesizer.synthesize(num_rows=10)
df_synthesized
title gender firstname lastname email country 0 Mr M Peter Harvey peter.harvey14@davies.net GB 1 Ms F Audrey Couturier audrey_couturier37@bonneau.com FR 2 Mr M Joshua Marquez joshua.marquez77@mckinney.com US 3 Mr M Ashley Hall ashley_hall@ingram.com GB 4 Mrs F Melanie Bryant melaniebryant14@bruce.biz GB [5 rows x 6 columns]
In the generated data, the names and emails correspond to the countries specified in the dataset. The distribution of countries in the original dataset will be preserved in the synthetic data.
Example 2: Using the locales
argument
In this example, we will generate consistent synthetic data for a dataset containing PII columns for individuals using the locales
argument. The dataset contains columns for title
, gender
, first_name
, last_name
, and email
. We will use the locales
argument to generate names and emails that correspond to the specified locales.
import pandas as pd
from synthesized.metadata.value import Person
from synthesized.config import PersonLabels
from synthesized import MetaExtractor, HighDimSynthesizer
orig_df = pd.DataFrame({
"title": ["Mr", "Ms", "Mrs"],
"gender": ["M", "F", "F"],
"firstname": ["John", "Jane", "Alice"],
"lastname": ["Smith", "Doe", "Smith"],
"email": [
"j.smith@gmail.com",
"jane.doe@blah.com",
"A.S.Smith@synth.io",
],
})
person = Person(
name='person',
labels=PersonLabels(
title="title",
gender="gender",
firstname="firstname",
lastname="lastname",
email="email",
),
locales=["ru_RU", "ja_JP", "it_IT"]
)
df_meta = MetaExtractor.extract(df=orig_df, annotations=[person])
synthesizer = HighDimSynthesizer(df_meta=df_meta)
synthesizer.learn(orig_df)
df_synthesized = synthesizer.synthesize(num_rows=5)
df_synthesized
title gender firstname lastname email 0 Mr M Леонтий Рыбаков леонтий.рыбаков1@yakusheva.biz 1 Ms F Serafina Balbi serafina_balbi30@sagese.it 2 Mr M 稔 山口 稔.山口76@kato.jp 3 Mr M Francesco Foa francesco_foa@collodi.eu 4 Mr M Pierluigi Roero pierluigi.roero12@bertoni.it [5 rows x 6 columns]
In the generated data, the names and emails correspond to the specified locales. The proportion of each locale will be approximately equal. As can be seen in the generated data we have names and emails that correspond to Russian, Japanese, and Italian locales as specified in the locales
argument.
Address
Similarly, an Address
annotation allows
Synthesized to generate fake address details.
Using the annotation the model will intelligently handle the generation of the address, ensuring consistency across the address fields for a given row in the synthetic dataset.
The columns of a dataset that relate to the attributes of an address are specified using the AddressLabels
class. This is used to define the address attributes that synthesized will generate. The AddressLabels
can contain the following attributes:
-
postcode
: Name of column containing postcodes. -
country
: Name of column containing country names or country codes. -
city
: Name of column containing city names. -
state
: Name of column containing state or county names. -
street
: Name of column containing district names. -
house_number
: Name of column containing house numbers. -
flat
: Name of column containing flat numbers. -
full_address
: Name of column containing full addresses which is a combination of all the address fields.
The Address
annotation can be configured to generate addresses that correspond to specific locales. This can be done by setting the locales
argument in the Address
annotation.
-
locales
: (defaultNone
) The locales to use for generating addresses. Thelocales
argument can be a single locale (e.g."en_GB"
) or a list of locales to sample from (e.g.["en_GB", "de_DE", "fr_FR"]
). Note that when theAddressLabel
contains acountry
label, theaddress_locale
will be ignored and the country specified in the dataset will be used to generate addresses.
The list of currently supported locales can be found by calling the provided utility method:
from synthesized.model.models.address import all_supported_locales
all_supported_locales()
['az_AZ', 'cs_CZ', 'da_DK', 'de_AT', 'de_CH', 'de_DE', ... ]
A minimal set of country name, code, and locale that are available in the Address annotation in all versions is given below
Minimal Supported Country Names, Codes and Locales
Country Code |
Country Names (capitalization ignored) |
Associated Locales |
AR |
argentina, argentine republic |
es_AR |
AM |
armenia, republic of armenia |
hy_AM |
AU |
australia |
en_AU |
AT |
austria, republic of austria |
de_AT |
AZ |
azerbaijan, republic of azerbaijan |
az_AZ |
BE |
belgium, kingdom of belgium |
nl_BE |
BD |
bangladesh, people’s republic of bangladesh |
bn_BD |
BR |
brazil, federative republic of brazil |
pt_BR |
CA |
canada |
en_CA, fr_CA |
CH |
switzerland, swiss confederation |
de_CH, fr_CH |
CL |
chile, republic of chile |
es_CL |
CN |
china, people’s republic of china |
zh_CN |
CO |
colombia, republic of colombia |
es_CO |
CZ |
czechia, czech republic |
cs_CZ |
DE |
germany, federal republic of germany |
de_DE |
DK |
denmark, kingdom of denmark |
da_DK |
ES |
spain, kingdom of spain |
es_ES |
FI |
finland, republic of finland |
fi_FI |
FR |
france, french republic |
fr_FR |
GB |
united kingdom, uk, united kingdom of great britain and northern ireland |
en_GB |
GE |
georgia |
ka_GE |
GR |
greece, hellenic republic |
el_GR |
HR |
croatia, republic of croatia |
hr_HR |
HU |
hungary |
hu_HU |
ID |
indonesia, republic of indonesia |
id_ID |
IN |
india, republic of india |
hi_IN, en_IN |
IE |
ireland |
en_IE |
IR |
iran, islamic republic of, iran, islamic republic of iran |
fa_IR |
IL |
israel, state of israel |
he_IL |
IT |
italy, italian republic |
it_IT |
JP |
japan |
ja_JP |
KR |
korea, republic of, south korea |
ko_KR |
MX |
mexico, united mexican states |
es_MX |
NL |
netherlands, kingdom of the netherlands |
nl_NL |
NO |
norway, kingdom of norway |
no_NO |
NP |
nepal, federal democratic republic of nepal |
ne_NP |
NZ |
new zealand |
en_NZ |
PH |
philippines, republic of the philippines |
en_PH, fil_PH |
PL |
poland, republic of poland |
pl_PL |
PT |
portugal, portuguese republic |
pt_PT |
RO |
romania |
ro_RO |
RU |
russian federation, russia |
ru_RU |
SK |
slovakia, slovak republic |
sk_SK |
SI |
slovenia, republic of slovenia |
sl_SI |
SE |
sweden, kingdom of sweden |
sv_SE |
TH |
thailand, kingdom of thailand |
th_TH |
UA |
ukraine |
uk_UA |
US |
united states, united states of america |
en_US |
The following demonstrates how to create an Address
annotation with the AddressLabels
and locales
argument. The Address
annotation is then passed to the MetaExtractor
as an annotation.
from synthesized.metadata.value import Address
from synthesized.config import AddressLabels
address = Address(
name='address',
labels=AddressLabels(
postcode='postcode',
city='city',
state='state',
street='street',
house_number='house_number',
flat='flat'
)
locales=["en_GB"]
)
df_meta = MetaExtractor.extract(df=data, annotations=[address])
It is possible to define multiple Address annotations if a dataset contains PII columns for more than one address. These must be created as separate Address objects with unique names, and then passed to the list of annotations, e.g.
|
Configuration
The internal operation of the Address model can be configured through the HighDimConfig
class. The following arguments can be set:
-
sample_addresses
: (defaultFalse
) If set toFalse
, the Address model will generate new synthetic addresses. If set toTrue
, the Address model will sample addresses from the original data. -
learn_postcodes
: (defaultFalse
) If set toFalse
, the Address model will generate addresses without learning the distribution of postcodes. If set toTrue
, the Address model will learn the distribution of postcodes from the dataset and use this information to generate realistic addresses that are consistent with the postcode geolocation in the original data. -
postcode_level
: (default0
) The level of postcode to use for modelling and sampling addresses (explained in more detail in ). The postcode level can be set to 0, 1, or 2. A postcode level of 0 will use the first part of the postcode to model address geolocation, a postcode level of 1 will use the first two parts of the postcode, and a postcode level of 2 will use the full postcode.
If |
The following example demonstrates how to use the HighDimConfig
class to configure the Address model with a set of parameters.
from synthesized.config import HighDimConfig
from synthesized import HighDimSynthesizer
config = HighDimConfig(
sample_addresses=False,
learn_postcodes=True,
postcode_level=1
)
synth = HighDimSynthesizer(meta, config=config)
Depending on the configuration settings, the Address model will operate in different modes. The following sections describe the different modes of operation in more detail.
Sampling Addresses
The sample_addresses
configuration setting allows the Address model to sample addresses from the original data or generate new synthetic addresses. The behaviour of the Address model is as follows:
-
If
sample_addresses
is set toFalse
(the default behaviour), the Address model will generate new synthetic addresses. The country or countries to sample the addresses from can be set using theaddress_locale
configuration setting, or by specifying acountry
label in theAddressLabels
. If neither theaddress_locale
nor thecountry
label is set, the default locale ofen_GB
will be used to generate addresses. -
If
sample_addresses
is set toTrue
, the addresses will be sampled from the original data.
If |
Learning postcodes
When sample_addresses
is set to False
, the Address model will generate synthetic addresses. Under this setting the model can learn the distribution of postcodes from the dataset using the learn_postcodes
configuration setting. The behaviour is as follows:
-
If
learn_postcodes
is set toFalse
(the default behaviour), the Address model will generate synthetic addresses without learning the distribution of postcodes. This means that the generated addresses will not be consistent with the postcode geolocation in the original data. -
If
learn_postcodes
is set toTrue
, the Address model will learn the distribution of postcodes present in the original dataset and generate realistic addresses that follow the distribution. The level at which the postcode is used for modelling and sampling addresses can be set using thepostcode_level
configuration setting which is explained in more detail in the next section.
It’s worth noting that when learning postcodes only a subset of the locales/countries are supported. The list of supported locales for postcode learning can be found by calling the provided utility method:
from synthesized.model.models.address import all_supported_postcode_locales
all_supported_postcode_locales()
['az_AZ', 'cs_CZ', 'da_DK', 'de_AT', 'de_CH', 'de_DE', ... ]
When using learning postcodes feature ( |
Postcode Level
The postcode_level
configuration setting allows the Address model to use different levels of the postcode for modelling and sampling addresses. The postcode level can be set to 0, 1, or 2. The behaviour is as follows:
-
If
postcode_level
is set to 0, the Address model will use the first part of the postcode to model address geolocation. This means that the generated addresses will be geographically consistent with the original data at the level of the first part of the postcode which typically represents a large area such as a state or region. Because of this, the address labelstate
will be kept consistent with the postcode if it is present. -
If
postcode_level
is set to 1, the Address model will use the first two parts of the postcode to model address geolocation. This means that the generated addresses will be geographically consistent with the original data at the level of the first two parts of the postcode which typically represents a smaller area such as a city. Because of this, the address labelsstate
andcity
will be kept consistent with the postcode if they are present. -
If
postcode_level
is set to 2, the Address model will use the full postcode to model address geolocation. This means that the generated addresses will be geographically consistent with the original data at the level of the full postcode which represents a specific location such as a street. Because of this, the address labelsstate
,city
andstreet
will be kept consistent with the postcode if they are present.
The following example demonstrates how the postcode level translates to a UK postcode:
for the postcode "EC2A 2DP"
-
postcode_level=0
will model to the level "EC" -
postcode_level=1
will model to the level "EC2A" -
postcode_level=2
will model to the level "EC2A 2DP"
If |
Through the use of these different levels you can generate addresses that are geographically consistent with the original data at different levels of granularity. The choice of postcode level you choose may be influenced by the level of detail you want to preserve in the synthetic data and the level of privacy you want to maintain. Higher levels of postcode_level
will result in more geographically accurate addresses but may also increase the risk of re-identification.
In all postcode levels new synthetic values for the labels |
Example 1: Using the locales
argument and no postcode learning
In this example, we will generate synthetic addresses that correspond to the specified locales using the locales
argument.
The input dataset is as follows:
postcode city street county 0 B38 4EY Lyndaland Kate coves Glamorgan 1 89638-7516 San Nancy los altos Prolongación Montenegro Sinaloa 2 42024-5681 San Irma los altos Privada Durango Morelos 3 B1S 6DN Barbarahaven Jones causeway Merseyside 4 45590 Roseview Parks Corners New Mexico 5 40272 Woodsborough Carrie Ranch Utah 6 W88 2RY South Darrenfort Hill mall Hampshire 7 75988 Lake Steven James Fords Michigan 8 53212-7399 Nueva Libia Circunvalación República Unida de Tanzanía Querétaro 9 E6T 4GL New Richardbury Foster groves Warwickshire [10 rows x 4 columns]
The dataset contains columns for postcode
, city
, street
, county
. We will use the Address
annotation and AddressLabels
to associate the address fields together. The dataset also contains addresses from the UK, the US and Mexico. We will use the locales
argument to generate addresses that correspond to these countries.
from synthesized.metadata.value import Address
from synthesized.config import AddressLabels
address = Address(
name='address',
labels=AddressLabels(
postcode='postcode',
city='city',
street='street',
state='county'
),
locales=["en_GB", "es_MX", "en_US"]
)
We will then use the MetaExtractor
class to extract the metadata from the original data and pass the Address
annotation to the annotations
parameter.
from synthesized import MetaExtractor
df_meta = MetaExtractor.extract(df=data, annotations=[address])
Finally, we will use the HighDimSynthesizer
class to generate synthetic data. We will use the configuration option sample_addresses=False
to generate new synthetic addresses.
from synthesized.config import HighDimConfig
from synthesized import HighDimSynthesizer
config = HighDimConfig(
sample_addresses=False,
)
synthesizer = HighDimSynthesizer(
df_meta=df_meta,
config=config
)
synthesizer.learn(data)
The trained model can then be used to generate the new synthetic addresses.
df_synthesized = synthesizer.synthesize(num_rows=10)
df_synthesized
postcode city street county 0 B38 4EY Lyndaland Kate coves Glamorgan 1 89638-7516 San Nancy los altos Prolongación Montenegro Sinaloa 2 42024-5681 San Irma los altos Privada Durango Morelos 3 B1S 6DN Barbarahaven Jones causeway Merseyside 4 45590 Roseview Parks Corners New Mexico 5 40272 Woodsborough Carrie Ranch Utah 6 W88 2RY South Darrenfort Hill mall Hampshire 7 75988 Lake Steven James Fords Michigan 8 53212-7399 Nueva Libia Circunvalación República Unida de Tanzanía Querétaro 9 E6T 4GL New Richardbury Foster groves Warwickshire [10 rows x 4 columns]
As can be seen in the generated data, the address components are consistent with the specified locales.
Example 2: Using the country
label and no postcode learning
In this example, we will generate synthetic addresses that are related to the original data at the country level using the country
label.
We will use the following dataset as the original data:
country postcode city street county 0 UK EN1W 8QZ South Charlotte Graham union Buckinghamshire 1 USA 31258 Greenburgh Peter Spur New Hampshire 2 UK RG08 9QP South Dawn Sharp drives Cardiganshire 3 Mexico 66471 San Julia los bajos Continuación Sur Henríquez Distrito Federal 4 Mexico 64913-3915 Nueva Singapur Calle Zacatecas Morelos 5 USA 40102 Johnport Stephen Plains Missouri 6 USA 96353 Villegashaven Bennett Stream Montana 7 USA 72485 Lake Jennifermouth Thompson Stravenue Arizona 8 UK N6E 3SJ Campbellberg Victoria coves Cardiganshire 9 Mexico 64079-8506 San María Luisa de la Montaña Callejón Sur Baeza Michoacán de Ocampo [10 rows x 5 columns]
The dataset contains the columns country
, postcode
, city
, street
and county
. We will use the Address
annotation and AddressLabels
to associate the address fields together.
from synthesized.metadata.value import Address
from synthesized.config import AddressLabels
address = Address(
name='address',
labels=AddressLabels(
country='country',
postcode='postcode',
city='city',
street='street',
state='county'
)
)
We will then use the MetaExtractor
class to extract the metadata from the original data and pass the Address
annotation to the annotations
parameter.
from synthesized import MetaExtractor
df_meta = MetaExtractor.extract(df=data, annotations=[address])
Finally, we will use the HighDimSynthesizer
class to generate synthetic data. We will use the configuration options, sample_addresses=False
and learn_postcodes=False
to generate new synthetic addresses that are related to the original data at the country level.
from synthesized.config import HighDimConfig
from synthesized import HighDimSynthesizer
config = HighDimConfig(
sample_addresses=False,
learn_postcodes=False,
)
synthesizer = HighDimSynthesizer(
df_meta=df_meta,
config=config
)
synthesizer.learn(data)
The trained model can then be used to generate the new synthetic addresses.
df_synthesized = synthesizer.synthesize(num_rows=10)
df_synthesized
country postcode city street county 0 Mexico 62154 San Ruby los bajos Retorno República Popular Democrática de Corea Baja California Sur 1 UK M7 9PT Wilsonburgh Brown trafficway Surrey 2 Mexico 05695 Vieja República de Macedonia del Norte Diagonal Bahrein Zacatecas 3 Mexico 63580 San Conchita de la Montaña Pasaje Veracruz de Ignacio de la Llave Durango 4 Mexico 63339 San Claudio de la Montaña Peatonal Veracruz de Ignacio de la Llave Sinaloa 5 Mexico 86445 Vieja Gabón Corredor Sur Raya Sinaloa 6 Mexico 61841 Nueva Lituania Calzada Camboya Tlaxcala 7 UK DD75 5BX Port Vanessachester Bishop pass Orkney Islands 8 UK E92 5TX Port Seanborough Walker drives Inverclyde 9 UK SG4V 3JB Woodland Dean turnpike Gloucestershire [10 rows x 6 columns]
As can be seen in the generated data, the addresses are related to the original data at the country level. For the remaining components of the address, new consistent synthetic values have been generated. The distribution of countries in the original data has been preserved in the synthetic data.
Example 3: Using the country
label and learning postcodes
In this example we will use the same dataset as in the previous example, but this time we will generate synthetic addresses that are related to the original data at the postcode level. We will use the learn_postcodes=True
and postcode_level=1
configuration settings to learn the distribution of postcodes from the dataset and generate addresses that are geographically consistent with the original data at the level of the first two parts of the postcode.
The input dataset:
country postcode city street county 0 UK EN1W 8QZ South Charlotte Graham union Buckinghamshire 1 USA 31258 Greenburgh Peter Spur New Hampshire 2 UK RG08 9QP South Dawn Sharp drives Cardiganshire 3 Mexico 66471 San Julia los bajos Continuación Sur Henríquez Distrito Federal 4 Mexico 64913-3915 Nueva Singapur Calle Zacatecas Morelos 5 USA 40102 Johnport Stephen Plains Missouri 6 USA 96353 Villegashaven Bennett Stream Montana 7 USA 72485 Lake Jennifermouth Thompson Stravenue Arizona 8 UK N6E 3SJ Campbellberg Victoria coves Cardiganshire 9 Mexico 64079-8506 San María Luisa de la Montaña Callejón Sur Baeza Michoacán de Ocampo [10 rows x 5 columns]
Again we create the Address
annotation and AddressLabels
to associate the address fields together.
from synthesized.metadata.value import Address
from synthesized.config import AddressLabels
address = Address(
name='address',
labels=AddressLabels(
country='country',
postcode='postcode',
city='city',
street='street',
state='state'
)
)
We then extract the metadata from the original data and pass the Address
annotation to the annotations
parameter.
from synthesized import MetaExtractor
df_meta = MetaExtractor.extract(df=data, annotations=[address])
Finally, we use the HighDimSynthesizer
class to generate synthetic data. We will use the configuration options, sample_addresses=False
, learn_postcodes=True
and postcode_level=1
to generate new synthetic addresses that are related to the original data at the specified postcode level.
from synthesized.config import HighDimConfig
config = HighDimConfig(
sample_addresses=False,
learn_postcodes=True,
postcode_level=1
)
synthesizer = HighDimSynthesizer(
df_meta=df_meta,
config=config
)
synthesizer.learn(data)
The trained model can then be used to generate the new synthetic addresses.
df_synthesized = synthesizer.synthesize(num_rows=10)
df_synthesized
country postcode city street county 0 UK N6E 1JE Campbellberg Conor estate Cardiganshire 1 UK N6E 0HF Campbellberg Lee walk Cardiganshire 2 Mexico 64907 Nueva Singapur Calle Sur Vigil Morelos 3 UK N6E 5BT Campbellberg Natalie lane Cardiganshire 4 UK EN1W 7RZ South Charlotte Mills squares Buckinghamshire 5 USA 72482 Lake Jennifermouth Donna Ford Arizona 6 Mexico 64075 San María Luisa de la Montaña Calle Olmos Michoacán de Ocampo 7 USA 72407 Lake Jennifermouth Johnson Via Arizona 8 USA 40181 Johnport Scott Bypass Missouri 9 UK EN1W 4RL South Charlotte Holmes mill Buckinghamshire [10 rows x 5 columns]
As can be seen in the generated data, the addresses are related to the original data at the postcode level. The first two parts of each postcode are preserved in the synthetic data, but the third part is generated. The fields county
and city
are also preserved and associated with the postcode. However, new values for the street
field have been generated.
Bank
Defining a Bank
annotation allows
Synthesized to generate fake bank account numbers and sort codes. Currently,
Synthesized can only generate 8-digit account numbers and 6-digit sort codes.
from synthesized.metadata.value import Bank
from synthesized.config import BankLabels
The columns of a dataset that relate to the bank account attributes are
specified using BankLabels
.
bank = Bank(
name='bank',
labels=BankLabels(
sort_code='sort_code',
account='account_number'
)
)
Company
Defining a Company
annotation allows Synthesized to generate fake company entities.
Below are some examples of how to use the Company annotation.
We start with a simple data frame (the contents of company the company names can be anything, as long as they are strings):
import pandas as pd
df = pd.DataFrame({
'company_name': ['Apple Inc.', 'Microsoft Corporation', 'Google LLC'],
'employee_count': [10000, 50000, 100000],
})
df
company_name | employee_count |
---|---|
Apple Inc. |
10000 |
Microsoft Corporation |
50000 |
Google LLC |
100000 |
First we specify a locale to generate the company names for. The default locale is en_GB.
from synthesized import HighDimSynthesizer
from synthesized.metadata.value import Company
from synthesized.config import CompanyLabels
# Set the locale of the company annotation to `locales`.
locales = ["en_GB"]
ann = Company(
'company_annotation',
labels=CompanyLabels(full_name='company_name'),
locales=locales
)
# Create the synthesizer and synthesize.
synthesizer = HighDimSynthesizer.from_df(df, annotations=[ann])
synthesizer.fit(df)
synthesizer.sample(3)
company_name | employee_count |
---|---|
Thomson LLC |
450 |
Harris Group |
800000 |
Malder Ltd. |
100000 |
It is also possible to specify multiple locales for the company annotation. This can be done by passing a list of locales to the
|
In order to generate company names along with their countries, we can use the country label when creating the annotation meta. Note that the original dataset must also contain a "country" column (even if the values are all "".
ann = Company(
'company_annotation',
labels=CompanyLabels(full_name='company_name', country='countries'),
locales=locales
)
company_name | employee_count | countries |
---|---|---|
Thomson LLC |
450 |
United Kingdom |
Sager GmbH |
800000 |
Germany |
Poirier SA |
100000 |
France |
FormattedString
A FormattedString
annotation can be used
to generate synthetic data that conforms to a given regular expression, e.g
social security numbers, or customer account numbers that have a specific
format.
from synthesized.metadata.value.categorical import FormattedString
The FormattedString
is defined by passing the respective column name,
and a regex pattern:
regex = "^(?!666|000|9\\d{2})\\d{3}-(?!00)\\d{2}-(?!0{4})\\d{4}$";
social_security = FormattedString(name="social_security_number", pattern=regex)
df_meta = MetaExtractor.extract(df=data, annotations=[social_security])