Annotations

The full source code for this example is available for download here. The dataset used in this example is available for download here.

Prerequisites

This tutorial assumes that you have already installed the Synthesized package and have an understanding how to use the tabular synthesizer. If you are new to Synthesized, we recommend you start with the quickstart guide and/or single table synthesis tutorial before jumping into this tutorial.

Introduction

In this tutorial we will demonstrate how the SDK can be used to annotate linked columns in a dataset. This process is required in order to generate production-like data containing fake PII not linked to any entity in the original dataset.

For more information on the techniques used in this tutorial, please refer to the documentation.

PII Dataset

In this tutorial we will utilise a dataset containing personally identifiable information (PII) in the form of names and addresses.

import pandas as pd

df = pd.read_csv("pii_dataset.csv")
df
             gender     title    first_name    last_name	                             email        name_partner	 gender_partner	    postcode	            city	        street	                        full_address
0	          Male	       Mr	     Imanol	      Kirlin	        imanol_kirlin@faulkner.com	     Mila Weissnat	         Female	    AB10 1AB	        Aberdeen	  Broad Street	     Broad Street, AB10 1AB Aberdeen
1	        Female	       Ms	    Claudie    Rodriguez           claudierodriguez91@haas.com	    Jorja Schuster	         Female	     IM1 1AG	     Isle of Man     Circular Road    Circular Road, IM1 1AG Isle of Man
2	          Male	       Mr	     Ismael	      Zemlak ismael-zemlak45@jackson-campbell.info	      Jalon Glover	           Male	    TN34 2EZ	        Hastings	 Baldslow Road	    Baldslow Road, TN34 2EZ Hastings
3	    Non-Binary	       Mx	      Jesus	  Rutherford	      jesus-rutherford61@nunez.com	       Martin Kihn	           Male	    LA22 9HA	       Ambleside	     Kirkfield	       Kirkfield, LA22 9HA Ambleside
4	        Female	      Mrs	     Leslee	       Brown	         leslee_brown42@mendez.org	   Derrell Keebler	           Male	      W9 2BT	          London	 Shirland Road	        Shirland Road, W9 2BT London
...	        ...	          ...	        ...	        ...	                                ...	              ...	                ...	         ...	             ...	           ...	                                ...
6068        Female	       Ms	    Louetta     O'Conner	    louetta_o'conner@gallagher.com	        Obed Terry	           Male	     HG4 2QN	           Ripon	Bishopton Lane	       Bishopton Lane, HG4 2QN Ripon
6069    Non-Binary	       Mx	      Fleet	    Thompson	       fleet_thompson@thompson.com	Leeann Stoltenberg	     Non-Binary	    EH10 4AN	       Edinburgh	 Falcon Avenue	   Falcon Avenue, EH10 4AN Edinburgh
6070          Male	       Mr	   Pleasant	    Kshlerin	    pleasant.kshlerin69@leonard.org	   Evelyne Bernier	         Female	     CM8 1SX	          Witham	  Holst Avenue	        Holst Avenue, CM8 1SX Witham
6071    Non-Binary	       Mx	     Tilden	     Dickens	         tilden.dickens@alvarez.org	      Savion Johns	           Male	     HA1 2RZ	          Harrow  Rosslyn Crescent      Rosslyn Crescent, HA1 2RZ Harrow
6072          Male	       Mr	       Lena	     Kilback	            lena.kilback19@lowe.com	    Rosanne Turner	         Female	    LN13 0AB	          Alford  Christopher Road     Christopher Road, LN13 0AB Alford

[6073 rows × 11 columns]

This dataset contains personal information about an individual, their partner and their address.

As discussed in the documentation strict relationships can often exist between features in a dataset which describe a single entity. For instance, in this dataset it is required that the full address is consistent with columns that contain granular attributes like postcode, city and street. Internal consistency of linked data is often required for downstream data processing tasks.

Similarly, real postcodes/zipcodes are often linked to geographic areas via their constituent pieces. Using the UK postcode "SW19 5AE" as an example, the structure can be broken down into several components:

  1. "SW" refers to the postcode area, in this case south-west London

  2. "SW19" refers to the district in the area, which in this case covers Wimbledon and Merton

  3. "SW19 5AE" refers to a specific set of addresses in the area

We will refer to these levels of geographic specificity as postcode levels.

By default the AddressModel is configured for UK postcodes, however it can be configured for other locales.

Using the AddressModel, a user can generate realistic new postcodes and addresses that are consistent with the geographic information specified in the original data.

Entity Annotation

The SDK can be used to generate completely new PII data that is representative of the original data via the use of Entity Annotation. The SDK does not automatically detect these entities in the data, they should be specified by the user and this tutorial will illustrate how to do this. In order to label a set of columns as pertaining to a specific entity, the AddressLabel and PersonLabel classes will be used in this example. Note that there are many more entity types available, a full list of which can be found here. Consider that we have a dataset with columns relating to a person, their partner and their address. The columns contain the first name, last name, title, gender and email address for the person. For their partner the columns contain their full name and gender. The address columns contain the postcode, city, street and full address.

We will use the PersonLabel and AddressLabel classes to label these columns as such:

from synthesized import MetaExtractor
from synthesized.config import AddressLabels, PersonLabels
from synthesized.metadata.value import Address, Person

address = Address(name="address", labels=AddressLabels(postcode="postcode", street="street", city="city", full_address="full_address"))
person = Person(name="person", labels=PersonLabels(firstname="first_name", lastname="last_name", title="title", gender="gender", email="email"))
person_partner = Person(name="person_partner", labels=PersonLabels(fullname="name_partner", gender="gender_partner"))

df_meta = MetaExtractor.extract(df, annotations=[address, person, person_partner])
print(list(df_meta.children))
[<Nominal[object]: Address(name=address)>,
 <Nominal[object]: Person(name=person)>,
 <Nominal[object]: Person(name=person_partner)>]

We use the metadata and a HighDimConfig object to create a HighDimSynthesizer object, as usual. In the HighDimConfig object we will specifiy the locale as "en_GB" and set the boolean flag sample_addresses to False. This boolean flag controls whether addresses are randomly sampled from the original data (sample_addresses=True) or if entirely new ones are generated (sample_addresses=False). By default this flag is set to False, which is required by most compliance tasks, however it is written out here explicitly for clarity.

from synthesized.config import HighDimConfig
from synthesized import HighDimSynthesizer
config = HighDimConfig(
    sample_addresses=False,
    address_locale="en_GB",
)

synth = HighDimSynthesizer(df_meta, config=config)
synth._df_model.children
[AddressModel(meta=<Nominal[object]: Address(name=address)>),
 PersonModel(meta=<Nominal[object]: Person(name=person)>),
 PersonModel(meta=<Nominal[object]: Person(name=person_partner)>)]

and train it

synth.learn(df)

By default, the AddressModel and PersonModel will generate brand new data:

df_synth = synth.synthesize(100)
df_synth
    gender	title	first_name	last_name	email	name_partner	gender_partner	postcode	city	street	full_address
0	Female	Mrs	Amanda	Davies	amandadavies@hopkins-roberts.com	Josh Rogers	Male	N3B 4XJ	Andersonshire	Barton manor	3 Barton manor, N3B 4XJ Andersonshire
1	Female	Mrs	Charlene	Smith	charlene-smith@simpson-mitchell.biz	Abdul Buckley	Non-Binary	WA8 1YH	Robinfurt	Rees islands	927 Rees islands, WA8 1YH Robinfurt
2	Male	Mr	Bryan	Thompson	bryanthompson@moss.com	Christine Scott	Female	BH44 2JU	East Sam	Davies mission	4 Davies mission, BH44 2JU East Sam
3	Male	Mr	Robert	Sutton	robert_sutton79@evans-baker.com	Robert Jones	Male	B3 6XD	South Marianmouth	Jones coves	28 Jones coves, B3 6XD South Marianmouth
4	Female	Ms	Jasmine	North	jasminenorth@jackson.com	Gail Rogers	Non-Binary	B5A 0YD	Shannonmouth	Green islands	83 Green islands, B5A 0YD Shannonmout
...	...	...	...	...	...	...	...	...

Note that the email address for each entity is consistent with the first name and last name of the individual, and that the fields postcode, city and street are consistent with the full_address. However, this address data is not realistic in the sense that the postcodes are not matched to geographic data, like the city, as described in the introduction.

In the following section we will demonstrate how we can configure the AddressModel to generate postcodes consistent with the geographic constraints present in the original data.

Generating fuzzed postcodes

Fuzzing postcodes means modifying the postcode of the address but maintaining some level of geographical locality. Users can configure the HighDimSynthesizer object using a HighDimConfig to learn portions of postcodes, and then conditionally synthesize data to ensure that geographic consistency is maintained across the generated data.

from synthesized.config import HighDimConfig

config = HighDimConfig(
    learn_postcodes=True,
    address_locale="en_GB",
    postcode_level=0,
)

The learn_postcodes model must be set to True to ensure that the SDK generates postcodes based of the original data, rather than generating new examples. Currently, it is required to specify address_locale. The postcode_level argument can be set between 0-2, with the values matching the three postcode levels described in the introduction.

By specifying postcode_level=0 in the HighDimConfig object above, we are configuring the SDK to produce postcodes that match only the first level of those seen in the original data, but new values for the final portion of the postcode.

We can observe this behaviour by training a HighDimSynthesizer

synth = HighDimSynthesizer(df_meta, config=config)
synth.learn(df)

and generating new data

df_synth = synth.synthesize(100)
df_synth
gender	title	first_name	last_name	email	name_partner	gender_partner	postcode	city	street	full_address
0	Female	Ms	Clare	Hill	clare.hill93@davies.com	Dawn Riley	Female	KT32 5YB	Port Daniel	Deborah mission	51 Deborah mission, KT32 5YB Port Daniel
1	Female	Mrs	Rachael	Dean	rachael-dean@shepherd.com	Henry Sims	Male	M3 3EN	Antonychester	Morgan lock	5 Morgan lock, M3 3EN Antonychester
2	Male	Mr	Mark	Martin	mark.martin@smith.com	Martyn Smith	Male	PR62 3SH	Rowleyberg	Shaw trail	807 Shaw trail, PR62 3SH Rowleyberg
3	Female	Ms	Jemma	Warner	jemma-warner@peters-howard.net	Jacqueline Singh	Female	RH2J 5BZ	South June	Barry cliffs	45 Barry cliffs, RH2J 5BZ South June
4	Female	Mrs	Kelly	Mahmood	kelly-mahmood@morgan-cunningham.info	Diane Mills	Female	DT3E 9SQ	Dayview	June pine	63 June pine, DT3E 9SQ Dayview
...	...	...	...	...	...	...	...	...	...	...	...
95	Male	Mr	Harry	Thomas	harry.thomas61@naylor.com	Charlotte Lane	Female	DD8 0PW	Dianaport	Mohamed freeway	987 Mohamed freeway, DD8 0PW Dianaport
96	Non-Binary	Mx	Stephanie	Hill	stephanie-hill@lawson.com	Jasmine Cooper	Non-Binary	DN2J 9PN	Lake Rosston	Robinson field	0 Robinson field, DN2J 9PN Lake Rosston
97	Male	Mr	Leslie	Parker	leslie.parker@evans-jackson.com	Andrew Harrison	Male	NP0 1WX	Leahfurt	Connor stravenue	56 Connor stravenue, NP0 1WX Leahfurt
98	Female	Ms	Rosemary	Vincent	rosemary_vincent@newton.com	Conor Lewis	Male	TA5S 2JY	Port Thomasside	Christine keys	1 Christine keys, TA5S 2JY Port Thomasside
99	Male	Mr	Scott	Roberts	scott.roberts53@willis.org	Joanne Holt	Female	L9N 2QB	East Emilymouth	Rebecca neck	494 Rebecca neck, L9N 2QB East Emilymouth
100 rows × 11 columns

However, we have generated brand new cities meaning that the postcodes are not consistent with the original geographic data present in the original data, specifically with regards to the fact that new cities have been generated. To ensure consistency between the generated data and the original, a user can conditionally generate address data:

# Using the original data to conditionally generate new data
df[["postcode", "city"]][0:4]
    postcode	city
0	AB10 1AB	Aberdeen
1	IM1 1AG	Isle of Man
2	TN34 2EZ	Hastings
3	LA22 9HA	Ambleside
df_address = synth._df_model.children[0].sample(conditions=df[["postcode", "city"]])
synth._df_model.children[0].meta.convert_df_for_children(df_address)
df_address
postcode	city	street	full_address
0	LA22 8FE	Ambleside	Patel river	978 Patel river, LA22 8FE Ambleside
1	AB10 4ZU	Aberdeen	Atkins ridge	79 Atkins ridge, AB10 4ZU Aberdeen
2	IM1 6BE	Isle of Man	Brian coves	45 Brian coves, IM1 6BE Isle of Man
3	TN34 6SL	Hastings	Francesca forks	3 Francesca forks, TN34 6SL Hastings

The postcodes generated have the same starting portion as those of the original dataset, however, the later portions are completely new, synthetic postcodes.

Extending to different countries

The example above demonstrated how the SDK can be used to generate privacy compliant data for individuals with UK addresses. However, both the PersonModel and AddressModel are extensible and can be used across multiple locales.

For example, the SDK can be used to generate brand new US addresses by specifying in the correct locale in the HighDimConfig object:

config = HighDimConfig(
    address_locale="en_US"
)
synth = HighDimSynthesizer(df_meta, config=config)
synth.learn(df)
synth.synthesize(100)

The list of currently supported locales can be found by calling the provided utility method:

from synthesized.model.models.address import all_supported_locales
all_supported_locales()
['az_AZ',
 'cs_CZ',
 'da_DK',
 'de_AT',
 'de_CH',
 'de_DE',
 ...
 ]

It’s worth noting that currently if the dataset contains postcodes only a subset of the locales are supported. The list of supported locales including postcodes can be found by calling the provided utility method:

from synthesized.model.models.address import all_supported_postcode_locales
all_supported_postcode_locales()
['az_AZ',
 'cs_CZ',
 'da_DK',
 'de_AT',
 'de_CH',
 'de_DE',
 ...
 ]