Annotations
Prerequisites
This tutorial assumes that you have already installed the Synthesized package and have an understanding how to use the tabular synthesizer. If you are new to Synthesized, we recommend you start with the quickstart guide and/or single table synthesis tutorial before jumping into this tutorial.
Introduction
In this tutorial we will demonstrate how the SDK can be used to annotate linked columns in a dataset. By annotating columns, you can use the SDK to generate synthetic personally identifiable information (PII) that isn’t linked to any entity in the original dataset.
For more information on the techniques used in this tutorial, please refer to the documentation.
PII Dataset
In this tutorial we will use a dataset containing PII in the form of names and addresses.
import pandas as pd
df = pd.read_csv("pii_dataset.csv")
df
gender title first_name last_name email name_partner gender_partner postcode city street full_address 0 Male Mr Imanol Kirlin imanol_kirlin@faulkner.com Mila Weissnat Female AB10 1AB Aberdeen Broad Street Broad Street, AB10 1AB Aberdeen 1 Female Ms Claudie Rodriguez claudierodriguez91@haas.com Jorja Schuster Female IM1 1AG Isle of Man Circular Road Circular Road, IM1 1AG Isle of Man 2 Male Mr Ismael Zemlak ismael-zemlak45@jackson-campbell.info Jalon Glover Male TN34 2EZ Hastings Baldslow Road Baldslow Road, TN34 2EZ Hastings 3 Non-Binary Mx Jesus Rutherford jesus-rutherford61@nunez.com Martin Kihn Male LA22 9HA Ambleside Kirkfield Kirkfield, LA22 9HA Ambleside 4 Female Mrs Leslee Brown leslee_brown42@mendez.org Derrell Keebler Male W9 2BT London Shirland Road Shirland Road, W9 2BT London ... ... ... ... ... ... ... ... ... ... ... ... 6068 Female Ms Louetta O'Conner louetta_o'conner@gallagher.com Obed Terry Male HG4 2QN Ripon Bishopton Lane Bishopton Lane, HG4 2QN Ripon 6069 Non-Binary Mx Fleet Thompson fleet_thompson@thompson.com Leeann Stoltenberg Non-Binary EH10 4AN Edinburgh Falcon Avenue Falcon Avenue, EH10 4AN Edinburgh 6070 Male Mr Pleasant Kshlerin pleasant.kshlerin69@leonard.org Evelyne Bernier Female CM8 1SX Witham Holst Avenue Holst Avenue, CM8 1SX Witham 6071 Non-Binary Mx Tilden Dickens tilden.dickens@alvarez.org Savion Johns Male HA1 2RZ Harrow Rosslyn Crescent Rosslyn Crescent, HA1 2RZ Harrow 6072 Male Mr Lena Kilback lena.kilback19@lowe.com Rosanne Turner Female LN13 0AB Alford Christopher Road Christopher Road, LN13 0AB Alford [6073 rows × 11 columns]
This dataset contains PII relating to an individual, their partner in the columns gender
, title
, first_name
, last_name
, email
, name_partner` and gender_partner
, and their joint address in the columns postcode
, city
, street
and full_address
.
Strict relationships can often exist between columns in a dataset which describe a single entity. For instance, in this dataset the column
full_address
is consistent with columns postcode
, city
and street
. Similarly, the email
column is consistent with the first_name
and last_name
columns.
By using annotations, the SDK can be used to generate new PII data that is consistent across these same columns.
More complex relationships can also exist between columns. For example, postcodes/zipcodes are linked geographically to other components of an address such as the street name and city.
Using Annotations, the SDK can be used to generate synthetic PII data that maintains the column consistency and relationships of the original data.
This tutorial will demonstrate how to use the Person
annotation and the Address
annotation to generate consistent, meaningful and useful synthetic PII data.
Person Annotation
The Person
annotation can be used to associate columns that relate to a single person and generate realistic and consistent synthetic PII data.
A full description of the Person
annotation can be found in the documentation.
In the dataset above, the columns first_name
, last_name
, title
, gender
and email
are all related to a single person.
The columns name_partner
and gender_partner
are related to another person (the partner of the first person).
We will use the PersonLabel
class to label these columns and create two Person
objects to use as annotations, one for the person (person_annot
) and one for the partner (person_partner_annot
).
from synthesized.config import PersonLabels
from synthesized.metadata.value import Person
person_annot = Person(name="person", labels=PersonLabels(firstname="first_name", lastname="last_name", title="title", email="email"))
person_partner_annot = Person(name="person_partner", labels=PersonLabels(fullname="name_partner", gender="gender_partner"))
These two annotations will be used in section Generating Data to generate consistent and realistic synthetic PII data.
Address Annotation
Using the Address
annotation, a user can generate realistic new postcodes and addresses that are consistent with the geographic information
specified in the original data.
In the dataset above, the columns postcode
, city
, street
and full_address
are all related to a single address.
We will use the AddressLabel
class to label these columns and create an Address
object to use as an annotation.
For this tutorial we will use the Address
annotation to generate new synthetic addresses from the UK. We will achieve this by providing the locale as "en_GB"
to the Address
annotation.
from synthesized.config import AddressLabels
from synthesized.metadata.value import Address
address_annot = Address(
name="address",
labels=AddressLabels(
postcode="postcode",
street="street",
city="city",
full_address="full_address"
),
locales="en_GB",
)
This address annotation will be used in section Generating Data to generate consistent and realistic synthetic PII data.
Generating Data
Using the annotations created in the previous sections, we can now generate synthetic PII data that is consistent with the original data. First we will extract the metadata and pass in the annotations we created earlier.
from synthesized import MetaExtractor
df_meta = MetaExtractor.extract(
df=df,
annotations=
[
person_annot,
person_partner_annot,
address_annot
]
)
We will use the postcode learning feature to ensure that the generated addresses are geographically consistent with the original data. Using postcode_level=1
ensures that the generated address is relatively close geographically to the original and the county
and city
will be kept consistent with the original data. The remaining address fields will be generated (see documentation for more information on postcode levels).
These options are configured in using the HighDimConfig
object when creating the HighDimSynthesizer
.
from synthesized.config import HighDimConfig
from synthesized import HighDimSynthesizer
config = HighDimConfig(
learn_postcodes=True,
postcode_level=1,
)
synth = HighDimSynthesizer(df_meta, config=config)
With the synthesizer configured, we can now train the model on the original data
synth.learn(df)
and generate new synthetic data.
df_synth = synth.synthesize(10)
df_synth
gender title first_name last_name email name_partner gender_partner postcode city street full_address 0 Non-Binary Mx Anne Martin anne.martin@chambers-robinson.com Kathleen Norman Female SO177FS Southampton Foster circles Flat 52q 4 Foster circles, SO177FS Southampton... 1 Male Mr Leon Cooper leoncooper39@murray.net Maurice Parry Male WS126RP Cannock Marion isle Flat 09T 28 Marion isle, WS126RP Cannock Suffolk 2 Male Mr Harry Graham harry_graham4@west-hall.com Jonathan Matthews Male DL143HZ Bishop Auckland Phillips landing Studio 90R 5 Phillips landing, DL143HZ Bishop ... 3 Male Mr Gareth Holt gareth.holt@edwards.org Garry Williams Male PL242XP Par Brown stravenue Flat 85T 825 Brown stravenue, PL242XP Par West... 4 Female Ms Linda Kaur linda.kaur@evans.com Guy Bowen Male DT118XZ Blandford Forum Declan club Flat 5 10 Declan club, DT118XZ Blandford Forum... 5 Male Mr Paul King paul.king87@watson.biz Sian Gallagher Female G111EJ Glasgow Brandon knoll Flat 34f 058 Brandon knoll, G111EJ Glasgow Fife 6 Non-Binary Mx Brenda Murphy brendamurphy96@baldwin.org Dennis Morris Male CO38HZ Colchester Owen common Studio 7 49 Owen common, CO38HZ Colchester Bre... 7 Female Ms Samantha McCarthy samantha.mccarthy32@dixon-ingram.biz Ronald Hawkins Male ZE19ZH Shetland Hazel skyway Studio 5 96 Hazel skyway, ZE19ZH Shetland West... 8 Male Mr Ryan Johnson ryan-johnson55@parker-jenkins.com Phillip Holden Male LL156PX Ruthin Parker valley Studio 4 9 Parker valley, LL156PX Ruthin East ... 9 Male Mr Rhys Smith rhyssmith22@north.org Steven Saunders Male NG10EG Nottingham Jones passage Studio 1 308 Jones passage, NG10EG Nottingham ... [10 rows x 11 columns]
Note that the fields gender
, title
, first_name
, last_name
, and email
for each person entity are consistent, and that
the fields postcode
, city
and street
are consistent with the full_address
. The first two parts of the postcode are also consistent with the original data along with the city
field.
The Person
and Address
annotations can also be used in many other scenarios to generate consistent and realistic synthetic data. More details on the different options available can be found in the documentation.