Entity Annotation

Configuring an entity annotation is necessary to generate realistic fake personally identifiable information (PII) such as customer names and addresses. Synthesized does not currently automatically recognize fields that contain PII, and therefore the default behaviour will be to generate the original data from such fields.

Tabular datasets often contain fields that when combined can describe a specific entity, such as a unique person or postal address. For example, consider the dataset below that contains customer PII

title	first_name	last_name	gender	email	amount
Mr	John	Doe	Male	john.doe@gmail.com	101.2
Mrs	Jane	Smith	Female	jane.smith@gmail.com	28.2
Dr	Albert	Taylor	Male	albert.taylor@aol.com	98.1
Ms	Alice	Smart	Female	alice.smart@hotmail.com	150.3

title

first_name

last_name

gender

amount

John

Doe

Male

john.doe@gmail.com

101.2

Mrs

Jane

Smith

Female

jane.smith@gmail.com

28.2

Albert

Taylor

Male

albert.taylor@aol.com

98.1

Alice

Smart

Female

alice.smart@hotmail.com

150.3

The combination of ('title', 'first_name', 'last_name', 'gender', 'email') describes a unique person in this data, and there are strict relationships between these attributes. E.g When "title" is "Mrs" or "Ms" then "first_name" will most likely contain a name given to females.

When it is important to maintain the correct description of an entity in the generated synthetic data, the dataset must be annotated to link the appropriate fields together. When annotated, Synthesized will learn to generate realistic entities as a whole, rather than independently generating individual attributes.

Currently, Synthesized can handle the following entities:

Person. Labels such as title, gender, full_name, first_name, last_name, email, username, password, mobile_number, home_number, work_number can be annotated and generated.
Address. Labels such as postcode, county, city, district, street, house, flat, house_name, full_address can be annotated and generated.
Bank Account. Labels such as bic, sort_code, account can be annotated and generated.
Company. Labels such as full_name, name, country, suffix, locales can be annotated and generated.
Formatted String. For a more flexible string generation, one can annotate a column as a formatted string, give it a pattern (in form of regular expression), and the software will generate random strings based on the given regular expression. For example, pattern="[A-Z]{4}-[0-9]{3}-[A-Z0-9]{6}" may generate "KWNF-971-K20X8B" or any other string that follows that pattern.

Person

Generating synthetic PII for individuals in a dataset can be achieved by defining a Person annotation. The Person model will intelligently handle the generation of the fields provided, ensuring consistency across gender, language, and country for a given row in the synthetic dataset.

The columns of a dataset that relate to the attributes of a person are specified using PersonLabels. This is used to define the Person attributes that synthesized will generate. The PersonLabels can contain the following attributes:

title: Name of column containing title (e.g "Mr", "Mrs").
gender: Name of column containing genders (e.g Male, Female, Non-binary).
fullname: Name of column containing full names.
firstname: Name of column containing first names.
lastname: Name of column containing last names.
email: Name of column containing email addresses.
username: Name of column containing usernames.
password: Name of column containing passwords.
mobile_number: Name of column containing mobile telephone numbers.
home_number: Name of column containing house telephone numbers.
work_number: Name of column containing work telephone numbers.
country: Name of column containing country names or country codes (e.g. Spain or ES).

from synthesized import MetaExtractor, HighDimSynthesizer
from synthesized.metadata.value import Person
from synthesized.config import PersonLabels

person = Person(
    name='person',
    labels=PersonLabels(
        gender='gender',
        title='title',
        firstname='first_name',
        lastname='last_name',
        email='email'
    )
)

df_meta = MetaExtractor.extract(df=data, annotations=[person])

It is possible to define multiple Person annotations if a dataset contains PII columns for more than one person. These must be created as separate Person objects with unique names, and then passed to the list of annotations, e.g.

MetaExtractor.extract(df=…, annotations=[person_1, person_2])

Locale/Language and Country

The Person annotation can be configured to generate names and emails that correspond to a specific language or country. This can be done in two ways:

By setting the locales argument in the Person annotation. This can be a single locale (e.g. "en_GB") or a list of locales to sample from (e.g. ["en_GB", "de_DE", "fr_FR"]).
Using a country column in the dataset containing country names or country codes (e.g. "Spain" or "ES"). The country label in the PersonLabels class must be set to label this column.

When a locale is set, the generated names and emails will correspond to that locale. If a country label is set, the generated names and emails will correspond to the country specified in the column.

The country label and the locales argument are mutually exclusive and cannot be used together. Trying to use both will result in an error.

For the Person annotation, the available locales will depend on the python version and associated installed packages. To list the suppported country names, country codes and locales, the following utility functions can be used:

from synthesized.model.models.person import all_supported_country_codes
all_supported_country_codes()

# Dictionary mapping country codes to locales
{'AR': ['es_AR'],
 'AM': ['hy_AM'],
 'AT': ['de_AT'],
 'AZ': ['az_AZ'],
 'BE': ['nl_BE', 'fr_BE'],...
 }

from synthesized.model.models.person import all_supported_country_names
all_supported_country_names()

# Dictionary mapping country codes to associated locales
{'argentina': ['es_AR'],
 'armenia': ['hy_AM'],
 'austria': ['de_AT'],
 'azerbaijan': ['az_AZ'],
 'belgium': ['nl_BE', 'fr_BE'],...
}

from synthesized.model.models.person import all_supported_locales
all_supported_locales()

# List of all supported locales
['ar_AA',
 'ar_PS',
 'ar_SA',
 'az_AZ',
 'bg_BG',...
]

A minimal set of country name, code, and locale that are available in the Person annotation in all versions is given below

Minimal Supported Country Names, Codes and Locales

Country Code

Country Names (capitalization ignored)

Associated Locales

argentina, argentine republic

es_AR

armenia, republic of armenia

hy_AM

austria, republic of austria

de_AT

azerbaijan, republic of azerbaijan

az_AZ

belgium, kingdom of belgium

nl_BE, fr_BE

bangladesh, people’s republic of bangladesh

bn_BD

bulgaria, republic of bulgaria

bg_BG

brazil, federative republic of brazil

pt_BR

canada

fr_CA

switzerland, swiss confederation

de_CH, fr_CH

chile, republic of chile

es_CL

china, people’s republic of china

zh_CN

colombia, republic of colombia

es_CO

czechia, czech republic

cs_CZ

germany, federal republic of germany

de_DE

denmark, kingdom of denmark

da_DK

spain, kingdom of spain

es_ES

estonia, republic of estonia

et_EE

finland, republic of finland

fi_FI

france, french republic

fr_FR

united kingdom, uk, united kingdom of great britain and northern ireland

en_GB

georgia

ka_GE

greece, hellenic republic

el_GR

croatia, republic of croatia

hr_HR

hungary

hu_HU

indonesia, republic of indonesia

id_ID

india, republic of india

hi_IN, en_IN

ireland

en_IE, ga_IE

iran, islamic republic of, iran, islamic republic of iran

fa_IR

israel, state of israel

he_IL

italy, italian republic

it_IT

japan

ja_JP

korea, republic of, south korea

ko_KR

lithuania, republic of lithuania

lt_LT

latvia, republic of latvia

lv_LV

mexico, united mexican states

es_MX

netherlands, kingdom of the netherlands

nl_NL

norway, kingdom of norway

no_NO

nepal, federal democratic republic of nepal

ne_NP

new zealand

en_NZ

poland, republic of poland

pl_PL

portugal, portuguese republic

pt_PT

palestine, state of, the state of palestine

ar_PS

romania

ro_RO

russian federation, russia

ru_RU

saudi arabia, kingdom of saudi arabia

ar_SA

slovenia, republic of slovenia

sl_SI

sweden, kingdom of sweden

sv_SE

thailand, kingdom of thailand

th_TH

türkiye, turkey, republic of türkiye

tr_TR

ukraine

uk_UA

united states, united states of america

en_US

Titles, genders and phone number attributes are currently not affected by the locale or country settings.

synthesizer = HighDimSynthesizer(df_meta=df_meta)
synthesizer.learn(...)
df_synthesized = synthesizer.synthesize(num_rows=...)

There are 4 special configuration cases for this model that should be considered:

The gender label is present: the distribution of genders will be learnt from this column.
No gender label present but title label is present: The distribution for genders will be inferred from the titles column.
Both gender and title labels are present: The distribution will be learnt from the gender column and values in the title column will align with this.
Neither gender nor title are present: The gender distribution will be assumed to be equal male and female.

Example 1: Using the `country` label

In this example, we will generate synthetic consistent data for a dataset containing PII columns for individuals using a country label. The dataset contains columns for title, gender, first_name, last_name, email, and country. We will use the country label to generate names and emails that correspond to the country specified in the dataset.

import pandas as pd
from synthesized.metadata.value import Person
from synthesized.config import PersonLabels
from synthesized import MetaExtractor, HighDimSynthesizer

orig_df = pd.DataFrame({
        "title": ["Mr", "Ms", "Mrs"],
        "gender": ["M", "F", "F"],
        "firstname": ["John", "Jane", "Alice"],
        "lastname": ["Smith", "Doe", "Smith"],
        "email": [
            "j.smith@gmail.com",
            "jane.doe@blah.com",
            "A.S.Smith@synth.io",
        ],
        "country": ["US", "GB", "FR"],
    })

person = Person(
    name='person',
    labels=PersonLabels(
        title="title",
        gender="gender",
        firstname="firstname",
        lastname="lastname",
        email="email",
        country="country",
    )
)

df_meta = MetaExtractor.extract(df=orig_df, annotations=[person])

synthesizer = HighDimSynthesizer(df_meta=df_meta)
synthesizer.learn(orig_df)
df_synthesized = synthesizer.synthesize(num_rows=10)
df_synthesized

  title gender firstname   lastname                           email country
0    Mr      M     Peter     Harvey       peter.harvey14@davies.net      GB
1    Ms      F    Audrey  Couturier  audrey_couturier37@bonneau.com      FR
2    Mr      M    Joshua    Marquez   joshua.marquez77@mckinney.com      US
3    Mr      M    Ashley       Hall          ashley_hall@ingram.com      GB
4   Mrs      F   Melanie     Bryant       melaniebryant14@bruce.biz      GB

[5 rows x 6 columns]

In the generated data, the names and emails correspond to the countries specified in the dataset. The distribution of countries in the original dataset will be preserved in the synthetic data.

Example 2: Using the `locales` argument

In this example, we will generate consistent synthetic data for a dataset containing PII columns for individuals using the locales argument. The dataset contains columns for title, gender, first_name, last_name, and email. We will use the locales argument to generate names and emails that correspond to the specified locales.

import pandas as pd
from synthesized.metadata.value import Person
from synthesized.config import PersonLabels
from synthesized import MetaExtractor, HighDimSynthesizer

orig_df = pd.DataFrame({
        "title": ["Mr", "Ms", "Mrs"],
        "gender": ["M", "F", "F"],
        "firstname": ["John", "Jane", "Alice"],
        "lastname": ["Smith", "Doe", "Smith"],
        "email": [
            "j.smith@gmail.com",
            "jane.doe@blah.com",
            "A.S.Smith@synth.io",
        ],
    })

person = Person(
    name='person',
    labels=PersonLabels(
        title="title",
        gender="gender",
        firstname="firstname",
        lastname="lastname",
        email="email",
    ),
    locales=["ru_RU", "ja_JP", "it_IT"]
)

df_meta = MetaExtractor.extract(df=orig_df, annotations=[person])

synthesizer = HighDimSynthesizer(df_meta=df_meta)
synthesizer.learn(orig_df)
df_synthesized = synthesizer.synthesize(num_rows=5)
df_synthesized

  title gender  firstname lastname                           email
0    Mr      M    Леонтий  Рыбаков  леонтий.рыбаков1@yakusheva.biz
1    Ms      F   Serafina    Balbi      serafina_balbi30@sagese.it
2    Mr      M         稔      山口                稔.山口76@kato.jp
3    Mr      M  Francesco      Foa        francesco_foa@collodi.eu
4    Mr      M  Pierluigi    Roero    pierluigi.roero12@bertoni.it

[5 rows x 6 columns]

In the generated data, the names and emails correspond to the specified locales. The proportion of each locale will be approximately equal. As can be seen in the generated data we have names and emails that correspond to Russian, Japanese, and Italian locales as specified in the locales argument.

Address

Similarly, an Address annotation allows Synthesized to generate fake address details. Using the annotation the model will intelligently handle the generation of the address, ensuring consistency across the address fields for a given row in the synthetic dataset.

The columns of a dataset that relate to the attributes of an address are specified using the AddressLabels class. This is used to define the address attributes that synthesized will generate. The AddressLabels can contain the following attributes:

postcode: Name of column containing postcodes.
country: Name of column containing country names or country codes.
city: Name of column containing city names.
state: Name of column containing state or county names.
street: Name of column containing district names.
house_number: Name of column containing house numbers.
flat: Name of column containing flat numbers.
full_address: Name of column containing full addresses which is a combination of all the address fields.

The Address annotation can be configured to generate addresses that correspond to specific locales. This can be done by setting the locales argument in the Address annotation.

locales: (default None) The locales to use for generating addresses. The locales argument can be a single locale (e.g. "en_GB") or a list of locales to sample from (e.g. ["en_GB", "de_DE", "fr_FR"]). Note that when the AddressLabel contains a country label, the address_locale will be ignored and the country specified in the dataset will be used to generate addresses.

The list of currently supported locales can be found by calling the provided utility method:

from synthesized.model.models.address import all_supported_locales
all_supported_locales()

['az_AZ',
 'cs_CZ',
 'da_DK',
 'de_AT',
 'de_CH',
 'de_DE',
 ...
 ]

A minimal set of country name, code, and locale that are available in the Address annotation in all versions is given below

Minimal Supported Country Names, Codes and Locales

Country Code

Country Names (capitalization ignored)

Associated Locales

argentina, argentine republic

es_AR

armenia, republic of armenia

hy_AM

australia

en_AU

austria, republic of austria

de_AT

azerbaijan, republic of azerbaijan

az_AZ

belgium, kingdom of belgium

nl_BE

bangladesh, people’s republic of bangladesh

bn_BD

brazil, federative republic of brazil

pt_BR

canada

en_CA, fr_CA

switzerland, swiss confederation

de_CH, fr_CH

chile, republic of chile

es_CL

china, people’s republic of china

zh_CN

colombia, republic of colombia

es_CO

czechia, czech republic

cs_CZ

germany, federal republic of germany

de_DE

denmark, kingdom of denmark

da_DK

spain, kingdom of spain

es_ES

finland, republic of finland

fi_FI

france, french republic

fr_FR

united kingdom, uk, united kingdom of great britain and northern ireland

en_GB

georgia

ka_GE

greece, hellenic republic

el_GR

croatia, republic of croatia

hr_HR

hungary

hu_HU

indonesia, republic of indonesia

id_ID

india, republic of india

hi_IN, en_IN

ireland

en_IE

iran, islamic republic of, iran, islamic republic of iran

fa_IR

israel, state of israel

he_IL

italy, italian republic

it_IT

japan

ja_JP

korea, republic of, south korea

ko_KR

mexico, united mexican states

es_MX

netherlands, kingdom of the netherlands

nl_NL

norway, kingdom of norway

no_NO

nepal, federal democratic republic of nepal

ne_NP

new zealand

en_NZ

philippines, republic of the philippines

en_PH, fil_PH

poland, republic of poland

pl_PL

portugal, portuguese republic

pt_PT

romania

ro_RO

russian federation, russia

ru_RU

slovakia, slovak republic

sk_SK

slovenia, republic of slovenia

sl_SI

sweden, kingdom of sweden

sv_SE

thailand, kingdom of thailand

th_TH

ukraine

uk_UA

united states, united states of america

en_US

The following demonstrates how to create an Address annotation with the AddressLabels and locales argument. The Address annotation is then passed to the MetaExtractor as an annotation.

from synthesized.metadata.value import Address
from synthesized.config import AddressLabels

address = Address(
    name='address',
    labels=AddressLabels(
        postcode='postcode',
        city='city',
        state='state',
        street='street',
        house_number='house_number',
        flat='flat'
    )
    locales=["en_GB"]
)

df_meta = MetaExtractor.extract(df=data, annotations=[address])

It is possible to define multiple Address annotations if a dataset contains PII columns for more than one address. These must be created as separate Address objects with unique names, and then passed to the list of annotations, e.g.

MetaExtractor.extract(df=…, annotations=[address_1, address_2])

Configuration

The internal operation of the Address model can be configured through the HighDimConfig class. The following arguments can be set:

sample_addresses: (default False) If set to False, the Address model will generate new synthetic addresses. If set to True, the Address model will sample addresses from the original data.
learn_postcodes: (default False) If set to False, the Address model will generate addresses without learning the distribution of postcodes. If set to True, the Address model will learn the distribution of postcodes from the dataset and use this information to generate realistic addresses that are consistent with the postcode geolocation in the original data.
postcode_level: (default 0) The level of postcode to use for modelling and sampling addresses (explained in more detail in ). The postcode level can be set to 0, 1, or 2. A postcode level of 0 will use the first part of the postcode to model address geolocation, a postcode level of 1 will use the first two parts of the postcode, and a postcode level of 2 will use the full postcode.

If sample_addresses is set to True, the Address model will sample addresses from the original data. This means that original address data will be present within the synthesized data which be used to re-identify individuals. It is recommended to set sample_addresses to False when generating synthetic data for privacy-sensitive datasets.

The following example demonstrates how to use the HighDimConfig class to configure the Address model with a set of parameters.

from synthesized.config import HighDimConfig
from synthesized import HighDimSynthesizer

config = HighDimConfig(
    sample_addresses=False,
    learn_postcodes=True,
    postcode_level=1
)

synth = HighDimSynthesizer(meta, config=config)

Depending on the configuration settings, the Address model will operate in different modes. The following sections describe the different modes of operation in more detail.

Sampling Addresses

The sample_addresses configuration setting allows the Address model to sample addresses from the original data or generate new synthetic addresses. The behaviour of the Address model is as follows:

If sample_addresses is set to False (the default behaviour), the Address model will generate new synthetic addresses. The country or countries to sample the addresses from can be set using the address_locale configuration setting, or by specifying a country label in the AddressLabels. If neither the address_locale nor the country label is set, the default locale of en_GB will be used to generate addresses.
If sample_addresses is set to True, the addresses will be sampled from the original data.

If sample_addresses is set to True then all other configuration settings will be ignored (learn_postcodes and postcode_level). If they are supplied a warning will be raised.

Learning postcodes

When sample_addresses is set to False, the Address model will generate synthetic addresses. Under this setting the model can learn the distribution of postcodes from the dataset using the learn_postcodes configuration setting. The behaviour is as follows:

If learn_postcodes is set to False (the default behaviour), the Address model will generate synthetic addresses without learning the distribution of postcodes. This means that the generated addresses will not be consistent with the postcode geolocation in the original data.
If learn_postcodes is set to True, the Address model will learn the distribution of postcodes present in the original dataset and generate realistic addresses that follow the distribution. The level at which the postcode is used for modelling and sampling addresses can be set using the postcode_level configuration setting which is explained in more detail in the next section.

It’s worth noting that when learning postcodes only a subset of the locales/countries are supported. The list of supported locales for postcode learning can be found by calling the provided utility method:

from synthesized.model.models.address import all_supported_postcode_locales
all_supported_postcode_locales()

['az_AZ',
 'cs_CZ',
 'da_DK',
 'de_AT',
 'de_CH',
 'de_DE',
 ...
 ]

When using learning postcodes feature (learn_postcodes=True), alongside the locales argument only one locale can be used. If multiple locales are specified an error will be raised.

Postcode Level

The postcode_level configuration setting allows the Address model to use different levels of the postcode for modelling and sampling addresses. The postcode level can be set to 0, 1, or 2. The behaviour is as follows:

If postcode_level is set to 0, the Address model will use the first part of the postcode to model address geolocation. This means that the generated addresses will be geographically consistent with the original data at the level of the first part of the postcode which typically represents a large area such as a state or region. Because of this, the address label state will be kept consistent with the postcode if it is present.
If postcode_level is set to 1, the Address model will use the first two parts of the postcode to model address geolocation. This means that the generated addresses will be geographically consistent with the original data at the level of the first two parts of the postcode which typically represents a smaller area such as a city. Because of this, the address labels state and city will be kept consistent with the postcode if they are present.
If postcode_level is set to 2, the Address model will use the full postcode to model address geolocation. This means that the generated addresses will be geographically consistent with the original data at the level of the full postcode which represents a specific location such as a street. Because of this, the address labels state, city and street will be kept consistent with the postcode if they are present.

The following example demonstrates how the postcode level translates to a UK postcode:

for the postcode "EC2A 2DP"

postcode_level=0 will model to the level "EC"
postcode_level=1 will model to the level "EC2A"
postcode_level=2 will model to the level "EC2A 2DP"

If learn_postcodes is set to False, the postcode_level configuration setting will be ignored.

Through the use of these different levels you can generate addresses that are geographically consistent with the original data at different levels of granularity. The choice of postcode level you choose may be influenced by the level of detail you want to preserve in the synthetic data and the level of privacy you want to maintain. Higher levels of postcode_level will result in more geographically accurate addresses but may also increase the risk of re-identification.

In all postcode levels new synthetic values for the labels house_number and flat will be generated.

Example 1: Using the `locales` argument and no postcode learning

In this example, we will generate synthetic addresses that correspond to the specified locales using the locales argument. The input dataset is as follows:

     postcode                 city                                      street        county
0     B38 4EY            Lyndaland                                  Kate coves     Glamorgan
1  89638-7516  San Nancy los altos                     Prolongación Montenegro       Sinaloa
2  42024-5681   San Irma los altos                             Privada Durango       Morelos
3     B1S 6DN         Barbarahaven                              Jones causeway    Merseyside
4       45590             Roseview                               Parks Corners    New Mexico
5       40272         Woodsborough                                Carrie Ranch          Utah
6     W88 2RY     South Darrenfort                                   Hill mall     Hampshire
7       75988          Lake Steven                                 James Fords      Michigan
8  53212-7399          Nueva Libia  Circunvalación República Unida de Tanzanía     Querétaro
9     E6T 4GL      New Richardbury                               Foster groves  Warwickshire

[10 rows x 4 columns]

The dataset contains columns for postcode, city, street, county. We will use the Address annotation and AddressLabels to associate the address fields together. The dataset also contains addresses from the UK, the US and Mexico. We will use the locales argument to generate addresses that correspond to these countries.

from synthesized.metadata.value import Address
from synthesized.config import AddressLabels

address = Address(
    name='address',
    labels=AddressLabels(
        postcode='postcode',
        city='city',
        street='street',
        state='county'
    ),
    locales=["en_GB", "es_MX", "en_US"]
)

We will then use the MetaExtractor class to extract the metadata from the original data and pass the Address annotation to the annotations parameter.

from synthesized import MetaExtractor

df_meta = MetaExtractor.extract(df=data, annotations=[address])

Finally, we will use the HighDimSynthesizer class to generate synthetic data. We will use the configuration option sample_addresses=False to generate new synthetic addresses.

from synthesized.config import HighDimConfig
from synthesized import HighDimSynthesizer

config = HighDimConfig(
   sample_addresses=False,
)
synthesizer = HighDimSynthesizer(
   df_meta=df_meta,
   config=config
)

synthesizer.learn(data)

The trained model can then be used to generate the new synthetic addresses.

df_synthesized = synthesizer.synthesize(num_rows=10)
df_synthesized

     postcode                 city                                      street        county
0     B38 4EY            Lyndaland                                  Kate coves     Glamorgan
1  89638-7516  San Nancy los altos                     Prolongación Montenegro       Sinaloa
2  42024-5681   San Irma los altos                             Privada Durango       Morelos
3     B1S 6DN         Barbarahaven                              Jones causeway    Merseyside
4       45590             Roseview                               Parks Corners    New Mexico
5       40272         Woodsborough                                Carrie Ranch          Utah
6     W88 2RY     South Darrenfort                                   Hill mall     Hampshire
7       75988          Lake Steven                                 James Fords      Michigan
8  53212-7399          Nueva Libia  Circunvalación República Unida de Tanzanía     Querétaro
9     E6T 4GL      New Richardbury                               Foster groves  Warwickshire

[10 rows x 4 columns]

As can be seen in the generated data, the address components are consistent with the specified locales.

Example 2: Using the `country` label and no postcode learning

In this example, we will generate synthetic addresses that are related to the original data at the country level using the country label.

We will use the following dataset as the original data:

  country    postcode                           city                      street               county
0      UK    EN1W 8QZ                South Charlotte                Graham union      Buckinghamshire
1     USA       31258                     Greenburgh                  Peter Spur        New Hampshire
2      UK    RG08 9QP                     South Dawn                Sharp drives        Cardiganshire
3  Mexico       66471            San Julia los bajos  Continuación Sur Henríquez     Distrito Federal
4  Mexico  64913-3915                 Nueva Singapur             Calle Zacatecas              Morelos
5     USA       40102                       Johnport              Stephen Plains             Missouri
6     USA       96353                  Villegashaven              Bennett Stream              Montana
7     USA       72485             Lake Jennifermouth          Thompson Stravenue              Arizona
8      UK     N6E 3SJ                   Campbellberg              Victoria coves        Cardiganshire
9  Mexico  64079-8506  San María Luisa de la Montaña          Callejón Sur Baeza  Michoacán de Ocampo

[10 rows x 5 columns]

The dataset contains the columns country, postcode, city, street and county. We will use the Address annotation and AddressLabels to associate the address fields together.

from synthesized.metadata.value import Address
from synthesized.config import AddressLabels

address = Address(
    name='address',
    labels=AddressLabels(
        country='country',
        postcode='postcode',
        city='city',
        street='street',
        state='county'
    )
)

We will then use the MetaExtractor class to extract the metadata from the original data and pass the Address annotation to the annotations parameter.

from synthesized import MetaExtractor

df_meta = MetaExtractor.extract(df=data, annotations=[address])

Finally, we will use the HighDimSynthesizer class to generate synthetic data. We will use the configuration options, sample_addresses=False and learn_postcodes=False to generate new synthetic addresses that are related to the original data at the country level.

from synthesized.config import HighDimConfig
from synthesized import HighDimSynthesizer

config = HighDimConfig(
   sample_addresses=False,
   learn_postcodes=False,
)
synthesizer = HighDimSynthesizer(
   df_meta=df_meta,
   config=config
)

synthesizer.learn(data)

The trained model can then be used to generate the new synthetic addresses.

df_synthesized = synthesizer.synthesize(num_rows=10)
df_synthesized

  country  postcode                                    city                                          street               county
0  Mexico     62154                      San Ruby los bajos  Retorno República Popular Democrática de Corea  Baja California Sur
1      UK    M7 9PT                             Wilsonburgh                                Brown trafficway               Surrey
2  Mexico     05695  Vieja República de Macedonia del Norte                                Diagonal Bahrein            Zacatecas
3  Mexico     63580              San Conchita de la Montaña          Pasaje Veracruz de Ignacio de la Llave              Durango
4  Mexico     63339               San Claudio de la Montaña        Peatonal Veracruz de Ignacio de la Llave              Sinaloa
5  Mexico     86445                             Vieja Gabón                               Corredor Sur Raya              Sinaloa
6  Mexico     61841                          Nueva Lituania                                 Calzada Camboya             Tlaxcala
7      UK  DD75 5BX                     Port Vanessachester                                     Bishop pass       Orkney Islands
8      UK   E92 5TX                        Port Seanborough                                   Walker drives           Inverclyde
9      UK  SG4V 3JB                                Woodland                                   Dean turnpike      Gloucestershire

[10 rows x 6 columns]

As can be seen in the generated data, the addresses are related to the original data at the country level. For the remaining components of the address, new consistent synthetic values have been generated. The distribution of countries in the original data has been preserved in the synthetic data.

Example 3: Using the `country` label and learning postcodes

In this example we will use the same dataset as in the previous example, but this time we will generate synthetic addresses that are related to the original data at the postcode level. We will use the learn_postcodes=True and postcode_level=1 configuration settings to learn the distribution of postcodes from the dataset and generate addresses that are geographically consistent with the original data at the level of the first two parts of the postcode.

The input dataset:

  country    postcode                           city                      street               county
0      UK    EN1W 8QZ                South Charlotte                Graham union      Buckinghamshire
1     USA       31258                     Greenburgh                  Peter Spur        New Hampshire
2      UK    RG08 9QP                     South Dawn                Sharp drives        Cardiganshire
3  Mexico       66471            San Julia los bajos  Continuación Sur Henríquez     Distrito Federal
4  Mexico  64913-3915                 Nueva Singapur             Calle Zacatecas              Morelos
5     USA       40102                       Johnport              Stephen Plains             Missouri
6     USA       96353                  Villegashaven              Bennett Stream              Montana
7     USA       72485             Lake Jennifermouth          Thompson Stravenue              Arizona
8      UK     N6E 3SJ                   Campbellberg              Victoria coves        Cardiganshire
9  Mexico  64079-8506  San María Luisa de la Montaña          Callejón Sur Baeza  Michoacán de Ocampo

[10 rows x 5 columns]

Again we create the Address annotation and AddressLabels to associate the address fields together.

from synthesized.metadata.value import Address
from synthesized.config import AddressLabels

address = Address(
    name='address',
    labels=AddressLabels(
        country='country',
        postcode='postcode',
        city='city',
        street='street',
        state='state'
    )
)

We then extract the metadata from the original data and pass the Address annotation to the annotations parameter.

from synthesized import MetaExtractor

df_meta = MetaExtractor.extract(df=data, annotations=[address])

Finally, we use the HighDimSynthesizer class to generate synthetic data. We will use the configuration options, sample_addresses=False, learn_postcodes=True and postcode_level=1 to generate new synthetic addresses that are related to the original data at the specified postcode level.

from synthesized.config import HighDimConfig

config = HighDimConfig(
   sample_addresses=False,
   learn_postcodes=True,
   postcode_level=1
)
synthesizer = HighDimSynthesizer(
   df_meta=df_meta,
   config=config
)

synthesizer.learn(data)

The trained model can then be used to generate the new synthetic addresses.

df_synthesized = synthesizer.synthesize(num_rows=10)
df_synthesized

  country  postcode                           city           street               county
0      UK   N6E 1JE                   Campbellberg     Conor estate        Cardiganshire
1      UK   N6E 0HF                   Campbellberg         Lee walk        Cardiganshire
2  Mexico     64907                 Nueva Singapur  Calle Sur Vigil              Morelos
3      UK   N6E 5BT                   Campbellberg     Natalie lane        Cardiganshire
4      UK  EN1W 7RZ                South Charlotte    Mills squares      Buckinghamshire
5     USA     72482             Lake Jennifermouth       Donna Ford              Arizona
6  Mexico     64075  San María Luisa de la Montaña      Calle Olmos  Michoacán de Ocampo
7     USA     72407             Lake Jennifermouth      Johnson Via              Arizona
8     USA     40181                       Johnport     Scott Bypass             Missouri
9      UK  EN1W 4RL                South Charlotte      Holmes mill      Buckinghamshire

[10 rows x 5 columns]

As can be seen in the generated data, the addresses are related to the original data at the postcode level. The first two parts of each postcode are preserved in the synthetic data, but the third part is generated. The fields county and city are also preserved and associated with the postcode. However, new values for the street field have been generated.

Bank

Defining a Bank annotation allows Synthesized to generate fake bank account numbers and sort codes. Currently, Synthesized can only generate 8-digit account numbers and 6-digit sort codes.

from synthesized.metadata.value import Bank
from synthesized.config import BankLabels

The columns of a dataset that relate to the bank account attributes are specified using BankLabels.

bank = Bank(
    name='bank',
    labels=BankLabels(
        sort_code='sort_code',
        account='account_number'
    )
)

Company

Defining a Company annotation allows Synthesized to generate fake company entities.

Below are some examples of how to use the Company annotation.

We start with a simple data frame (the contents of company the company names can be anything, as long as they are strings):

import pandas as pd

df = pd.DataFrame({
	'company_name': ['Apple Inc.', 'Microsoft Corporation', 'Google LLC'],
	'employee_count': [10000, 50000, 100000],
})
df

company_name	employee_count
Apple Inc.	10000
Microsoft Corporation	50000
Google LLC	100000

First we specify a locale to generate the company names for. The default locale is en_GB.

from synthesized import HighDimSynthesizer
from synthesized.metadata.value import Company
from synthesized.config import CompanyLabels

# Set the locale of the company annotation to `locales`.
locales = ["en_GB"]

ann = Company(
    'company_annotation',
    labels=CompanyLabels(full_name='company_name'),
    locales=locales
)

# Create the synthesizer and synthesize.
synthesizer = HighDimSynthesizer.from_df(df, annotations=[ann])
synthesizer.fit(df)
synthesizer.sample(3)

company_name	employee_count
Thomson LLC	450
Harris Group	800000
Malder Ltd.	100000

It is also possible to specify multiple locales for the company annotation. This can be done by passing a list of locales to the locales parameter of the CompanyLabels class.

locales = ["en_GB", "de_DE", "fr_FR"]

In order to generate company names along with their countries, we can use the country label when creating the annotation meta. Note that the original dataset must also contain a "country" column (even if the values are all "".

ann = Company(
    'company_annotation',
    labels=CompanyLabels(full_name='company_name', country='countries'),
    locales=locales
)

company_name	employee_count	countries
Thomson LLC	450	United Kingdom
Sager GmbH	800000	Germany
Poirier SA	100000	France

company_name

employee_count

countries

Thomson LLC

450

United Kingdom

Sager GmbH

800000

Germany

Poirier SA

100000

France

FormattedString

A FormattedString annotation can be used to generate synthetic data that conforms to a given regular expression, e.g social security numbers, or customer account numbers that have a specific format.

from synthesized.metadata.value.categorical import FormattedString

The FormattedString is defined by passing the respective column name, and a regex pattern:

regex = "^(?!666|000|9\\d{2})\\d{3}-(?!00)\\d{2}-(?!0{4})\\d{4}$";
social_security = FormattedString(name="social_security_number", pattern=regex)

df_meta = MetaExtractor.extract(df=data, annotations=[social_security])

Entity Annotation

Person

Locale/Language and Country

Example 1: Using the country label

Example 2: Using the locales argument

Address

Configuration

Sampling Addresses

Learning postcodes

Postcode Level

Example 1: Using the locales argument and no postcode learning

Example 2: Using the country label and no postcode learning

Example 3: Using the country label and learning postcodes

Bank

Company

FormattedString

Example 1: Using the `country` label

Example 2: Using the `locales` argument

Example 1: Using the `locales` argument and no postcode learning

Example 2: Using the `country` label and no postcode learning

Example 3: Using the `country` label and learning postcodes