Annotations

  • The code used in this example is available for download here.

  • The dataset used in this example is available for download here.

Prerequisites

This tutorial assumes that you have already installed the Synthesized package and have an understanding how to use the tabular synthesizer. If you are new to Synthesized, we recommend you start with the quickstart guide and/or single table synthesis tutorial before jumping into this tutorial.

Introduction

In this tutorial we will demonstrate how the SDK can be used to annotate linked columns in a dataset. By annotating columns, you can use the SDK to generate synthetic personally identifiable information (PII) that isn’t linked to any entity in the original dataset.

For more information on the techniques used in this tutorial, please refer to the documentation.

PII Dataset

In this tutorial we will use a dataset containing PII in the form of names and addresses.

import pandas as pd

df = pd.read_csv("pii_dataset.csv")
df
             gender     title    first_name    last_name	                             email        name_partner	 gender_partner	    postcode	            city	        street	                        full_address
0	          Male	       Mr	     Imanol	      Kirlin	        imanol_kirlin@faulkner.com	     Mila Weissnat	         Female	    AB10 1AB	        Aberdeen	  Broad Street	     Broad Street, AB10 1AB Aberdeen
1	        Female	       Ms	    Claudie    Rodriguez           claudierodriguez91@haas.com	    Jorja Schuster	         Female	     IM1 1AG	     Isle of Man     Circular Road    Circular Road, IM1 1AG Isle of Man
2	          Male	       Mr	     Ismael	      Zemlak ismael-zemlak45@jackson-campbell.info	      Jalon Glover	           Male	    TN34 2EZ	        Hastings	 Baldslow Road	    Baldslow Road, TN34 2EZ Hastings
3	    Non-Binary	       Mx	      Jesus	  Rutherford	      jesus-rutherford61@nunez.com	       Martin Kihn	           Male	    LA22 9HA	       Ambleside	     Kirkfield	       Kirkfield, LA22 9HA Ambleside
4	        Female	      Mrs	     Leslee	       Brown	         leslee_brown42@mendez.org	   Derrell Keebler	           Male	      W9 2BT	          London	 Shirland Road	        Shirland Road, W9 2BT London
...	        ...	          ...	        ...	        ...	                                ...	              ...	                ...	         ...	             ...	           ...	                                ...
6068        Female	       Ms	    Louetta     O'Conner	    louetta_o'conner@gallagher.com	        Obed Terry	           Male	     HG4 2QN	           Ripon	Bishopton Lane	       Bishopton Lane, HG4 2QN Ripon
6069    Non-Binary	       Mx	      Fleet	    Thompson	       fleet_thompson@thompson.com	Leeann Stoltenberg	     Non-Binary	    EH10 4AN	       Edinburgh	 Falcon Avenue	   Falcon Avenue, EH10 4AN Edinburgh
6070          Male	       Mr	   Pleasant	    Kshlerin	    pleasant.kshlerin69@leonard.org	   Evelyne Bernier	         Female	     CM8 1SX	          Witham	  Holst Avenue	        Holst Avenue, CM8 1SX Witham
6071    Non-Binary	       Mx	     Tilden	     Dickens	         tilden.dickens@alvarez.org	      Savion Johns	           Male	     HA1 2RZ	          Harrow  Rosslyn Crescent      Rosslyn Crescent, HA1 2RZ Harrow
6072          Male	       Mr	       Lena	     Kilback	            lena.kilback19@lowe.com	    Rosanne Turner	         Female	    LN13 0AB	          Alford  Christopher Road     Christopher Road, LN13 0AB Alford

[6073 rows × 11 columns]

This dataset contains PII relating to an individual, their partner in the columns gender, title, first_name, last_name, email, name_partner` and gender_partner, and their joint address in the columns postcode, city, street and full_address.

Strict relationships can often exist between columns in a dataset which describe a single entity. For instance, in this dataset the column full_address is consistent with columns postcode, city and street. Similarly, the email column is consistent with the first_name and last_name columns. By using annotations, the SDK can be used to generate new PII data that is consistent across these same columns.

More complex relationships can also exist between columns. For example, postcodes/zipcodes are linked geographically to other components of an address such as the street name and city.

Using Annotations, the SDK can be used to generate synthetic PII data that maintains the column consistency and relationships of the original data.

This tutorial will demonstrate how to use the Person annotation and the Address annotation to generate consistent, meaningful and useful synthetic PII data.

Person Annotation

The Person annotation can be used to associate columns that relate to a single person and generate realistic and consistent synthetic PII data. A full description of the Person annotation can be found in the documentation.

In the dataset above, the columns first_name, last_name, title, gender and email are all related to a single person. The columns name_partner and gender_partner are related to another person (the partner of the first person).

We will use the PersonLabel class to label these columns and create two Person objects to use as annotations, one for the person (person_annot) and one for the partner (person_partner_annot).

from synthesized.config import PersonLabels
from synthesized.metadata.value import Person

person_annot = Person(name="person", labels=PersonLabels(firstname="first_name", lastname="last_name", title="title", email="email"))
person_partner_annot = Person(name="person_partner", labels=PersonLabels(fullname="name_partner", gender="gender_partner"))

These two annotations will be used in section Generating Data to generate consistent and realistic synthetic PII data.

Address Annotation

Using the Address annotation, a user can generate realistic new postcodes and addresses that are consistent with the geographic information specified in the original data.

In the dataset above, the columns postcode, city, street and full_address are all related to a single address.

We will use the AddressLabel class to label these columns and create an Address object to use as an annotation. For this tutorial we will use the Address annotation to generate new synthetic addresses from the UK. We will achieve this by providing the locale as "en_GB" to the Address annotation.

from synthesized.config import AddressLabels
from synthesized.metadata.value import Address

address_annot = Address(
    name="address",
    labels=AddressLabels(
        postcode="postcode",
        street="street",
        city="city",
        full_address="full_address"
        ),
    locales="en_GB",
    )

This address annotation will be used in section Generating Data to generate consistent and realistic synthetic PII data.

Generating Data

Using the annotations created in the previous sections, we can now generate synthetic PII data that is consistent with the original data. First we will extract the metadata and pass in the annotations we created earlier.

from synthesized import MetaExtractor

df_meta = MetaExtractor.extract(
    df=df,
    annotations=
        [
            person_annot,
            person_partner_annot,
            address_annot
        ]
    )

We will use the postcode learning feature to ensure that the generated addresses are geographically consistent with the original data. Using postcode_level=1 ensures that the generated address is relatively close geographically to the original and the county and city will be kept consistent with the original data. The remaining address fields will be generated (see documentation for more information on postcode levels). These options are configured in using the HighDimConfig object when creating the HighDimSynthesizer.

from synthesized.config import HighDimConfig
from synthesized import HighDimSynthesizer
config = HighDimConfig(
    learn_postcodes=True,
    postcode_level=1,
)

synth = HighDimSynthesizer(df_meta, config=config)

With the synthesizer configured, we can now train the model on the original data

synth.learn(df)

and generate new synthetic data.

df_synth = synth.synthesize(10)
df_synth
       gender title first_name last_name                                 email       name_partner gender_partner  postcode             city            street                                      full_address
0  Non-Binary    Mx       Anne    Martin     anne.martin@chambers-robinson.com    Kathleen Norman         Female   SO177FS      Southampton    Foster circles   Flat 52q 4 Foster circles, SO177FS Southampton...
1        Male    Mr       Leon    Cooper               leoncooper39@murray.net      Maurice Parry           Male   WS126RP          Cannock       Marion isle    Flat 09T 28 Marion isle, WS126RP Cannock Suffolk
2        Male    Mr      Harry    Graham           harry_graham4@west-hall.com  Jonathan Matthews           Male   DL143HZ  Bishop Auckland  Phillips landing   Studio 90R 5 Phillips landing, DL143HZ Bishop ...
3        Male    Mr     Gareth      Holt               gareth.holt@edwards.org     Garry Williams           Male   PL242XP              Par   Brown stravenue   Flat 85T 825 Brown stravenue, PL242XP Par West...
4      Female    Ms      Linda      Kaur                  linda.kaur@evans.com          Guy Bowen           Male   DT118XZ  Blandford Forum       Declan club   Flat 5 10 Declan club, DT118XZ Blandford Forum...
5        Male    Mr       Paul      King                paul.king87@watson.biz     Sian Gallagher         Female    G111EJ          Glasgow     Brandon knoll     Flat 34f 058 Brandon knoll, G111EJ Glasgow Fife
6  Non-Binary    Mx     Brenda    Murphy            brendamurphy96@baldwin.org      Dennis Morris           Male    CO38HZ       Colchester       Owen common   Studio 7 49 Owen common, CO38HZ Colchester Bre...
7      Female    Ms   Samantha  McCarthy  samantha.mccarthy32@dixon-ingram.biz     Ronald Hawkins           Male    ZE19ZH         Shetland      Hazel skyway   Studio 5 96 Hazel skyway, ZE19ZH Shetland West...
8        Male    Mr       Ryan   Johnson     ryan-johnson55@parker-jenkins.com     Phillip Holden           Male   LL156PX           Ruthin     Parker valley   Studio 4 9 Parker valley, LL156PX Ruthin East ...
9        Male    Mr       Rhys     Smith                 rhyssmith22@north.org    Steven Saunders           Male    NG10EG       Nottingham     Jones passage   Studio 1 308 Jones passage, NG10EG Nottingham ...

[10 rows x 11 columns]

Note that the fields gender, title, first_name, last_name, and email for each person entity are consistent, and that the fields postcode, city and street are consistent with the full_address. The first two parts of the postcode are also consistent with the original data along with the city field.

The Person and Address annotations can also be used in many other scenarios to generate consistent and realistic synthetic data. More details on the different options available can be found in the documentation.