Annotations
Prerequisites
This tutorial assumes that you have already installed the Synthesized package and have an understanding how to use the tabular synthesizer. If you are new to Synthesized, we recommend you start with the quickstart guide and/or single table synthesis tutorial before jumping into this tutorial.
Introduction
In this tutorial we will demonstrate how the SDK can be used to annotate linked columns in a dataset. This process is required in order to generate production-like data containing fake PII not linked to any entity in the original dataset.
For more information on the techniques used in this tutorial, please refer to the documentation.
PII Dataset
In this tutorial we will utilise a dataset containing personally identifiable information (PII) in the form of names and addresses.
import pandas as pd
df = pd.read_csv("pii_dataset.csv")
df
gender title first_name last_name email name_partner gender_partner postcode city street full_address 0 Male Mr Imanol Kirlin imanol_kirlin@faulkner.com Mila Weissnat Female AB10 1AB Aberdeen Broad Street Broad Street, AB10 1AB Aberdeen 1 Female Ms Claudie Rodriguez claudierodriguez91@haas.com Jorja Schuster Female IM1 1AG Isle of Man Circular Road Circular Road, IM1 1AG Isle of Man 2 Male Mr Ismael Zemlak ismael-zemlak45@jackson-campbell.info Jalon Glover Male TN34 2EZ Hastings Baldslow Road Baldslow Road, TN34 2EZ Hastings 3 Non-Binary Mx Jesus Rutherford jesus-rutherford61@nunez.com Martin Kihn Male LA22 9HA Ambleside Kirkfield Kirkfield, LA22 9HA Ambleside 4 Female Mrs Leslee Brown leslee_brown42@mendez.org Derrell Keebler Male W9 2BT London Shirland Road Shirland Road, W9 2BT London ... ... ... ... ... ... ... ... ... ... ... ... 6068 Female Ms Louetta O'Conner louetta_o'conner@gallagher.com Obed Terry Male HG4 2QN Ripon Bishopton Lane Bishopton Lane, HG4 2QN Ripon 6069 Non-Binary Mx Fleet Thompson fleet_thompson@thompson.com Leeann Stoltenberg Non-Binary EH10 4AN Edinburgh Falcon Avenue Falcon Avenue, EH10 4AN Edinburgh 6070 Male Mr Pleasant Kshlerin pleasant.kshlerin69@leonard.org Evelyne Bernier Female CM8 1SX Witham Holst Avenue Holst Avenue, CM8 1SX Witham 6071 Non-Binary Mx Tilden Dickens tilden.dickens@alvarez.org Savion Johns Male HA1 2RZ Harrow Rosslyn Crescent Rosslyn Crescent, HA1 2RZ Harrow 6072 Male Mr Lena Kilback lena.kilback19@lowe.com Rosanne Turner Female LN13 0AB Alford Christopher Road Christopher Road, LN13 0AB Alford [6073 rows × 11 columns]
This dataset contains personal information about an individual, their partner and their address.
As discussed in the documentation strict relationships can often exist between features in a dataset which describe a single entity. For instance, in this dataset it is required that the full address is consistent with columns that contain granular attributes like postcode, city and street. Internal consistency of linked data is often required for downstream data processing tasks.
Similarly, real postcodes/zipcodes are often linked to geographic areas via their constituent pieces. Using the UK postcode
"SW19 5AE"
as an example, the structure can be broken down into several components:
-
"SW"
refers to the postcode area, in this case south-west London -
"SW19"
refers to the district in the area, which in this case covers Wimbledon and Merton -
"SW19 5AE"
refers to a specific set of addresses in the area
We will refer to these levels of geographic specificity as postcode levels.
By default the |
Using the AddressModel
, a user can generate realistic new postcodes and addresses that are consistent with the geographic information
specified in the original data.
Entity Annotation
The SDK can be used to generate completely new PII data that is representative of the original data via the use of Entity Annotation. The SDK does not automatically detect these entities in the data, they should be specified by the user and this tutorial will illustrate how to do this. In order to label a set of columns as pertaining to a specific entity, the AddressLabel
and PersonLabel
classes will be
used in this example. Note that there are many more entity types available, a full list of which can be found
here. Consider that we have a dataset with columns relating to a person, their partner and
their address. The columns contain the first name, last name, title, gender and email address for the person. For their partner the columns contain their full name and gender. The address columns contain the postcode, city, street and full address.
We will use the PersonLabel
and AddressLabel
classes to label these columns as such:
from synthesized import MetaExtractor
from synthesized.config import AddressLabels, PersonLabels
from synthesized.metadata.value import Address, Person
address = Address(name="address", labels=AddressLabels(postcode="postcode", street="street", city="city", full_address="full_address"))
person = Person(name="person", labels=PersonLabels(firstname="first_name", lastname="last_name", title="title", gender="gender", email="email"))
person_partner = Person(name="person_partner", labels=PersonLabels(fullname="name_partner", gender="gender_partner"))
df_meta = MetaExtractor.extract(df, annotations=[address, person, person_partner])
print(list(df_meta.children))
[<Nominal[object]: Address(name=address)>, <Nominal[object]: Person(name=person)>, <Nominal[object]: Person(name=person_partner)>]
We use the metadata and a HighDimConfig
object to create a HighDimSynthesizer
object, as usual. In the HighDimConfig
object we will specifiy the locale as "en_GB"
and set the boolean flag sample_addresses
to False
. This boolean flag
controls whether addresses are randomly sampled from the original data (sample_addresses=True
) or if entirely new ones
are generated (sample_addresses=False
). By default this flag is set to False
, which is required by most compliance tasks,
however it is written out here explicitly for clarity.
from synthesized.config import HighDimConfig
from synthesized import HighDimSynthesizer
config = HighDimConfig(
sample_addresses=False,
address_locale="en_GB",
)
synth = HighDimSynthesizer(df_meta, config=config)
synth._df_model.children
[AddressModel(meta=<Nominal[object]: Address(name=address)>), PersonModel(meta=<Nominal[object]: Person(name=person)>), PersonModel(meta=<Nominal[object]: Person(name=person_partner)>)]
and train it
synth.learn(df)
By default, the AddressModel
and PersonModel
will generate brand new data:
df_synth = synth.synthesize(100)
df_synth
gender title first_name last_name email name_partner gender_partner postcode city street full_address 0 Female Mrs Amanda Davies amandadavies@hopkins-roberts.com Josh Rogers Male N3B 4XJ Andersonshire Barton manor 3 Barton manor, N3B 4XJ Andersonshire 1 Female Mrs Charlene Smith charlene-smith@simpson-mitchell.biz Abdul Buckley Non-Binary WA8 1YH Robinfurt Rees islands 927 Rees islands, WA8 1YH Robinfurt 2 Male Mr Bryan Thompson bryanthompson@moss.com Christine Scott Female BH44 2JU East Sam Davies mission 4 Davies mission, BH44 2JU East Sam 3 Male Mr Robert Sutton robert_sutton79@evans-baker.com Robert Jones Male B3 6XD South Marianmouth Jones coves 28 Jones coves, B3 6XD South Marianmouth 4 Female Ms Jasmine North jasminenorth@jackson.com Gail Rogers Non-Binary B5A 0YD Shannonmouth Green islands 83 Green islands, B5A 0YD Shannonmout ... ... ... ... ... ... ... ... ...
Note that the email address for each entity is consistent with the first name and last name of the individual, and that
the fields postcode, city and street are consistent with the full_address
. However, this address data is not realistic
in the sense that the postcodes are not matched to geographic data, like the city, as described in the introduction.
In the following section we will demonstrate how we can configure the AddressModel
to generate postcodes consistent with
the geographic constraints present in the original data.
Generating fuzzed postcodes
Fuzzing postcodes means modifying the postcode of the address but maintaining some level of geographical locality. Users can configure the HighDimSynthesizer
object using a HighDimConfig
to learn portions
of postcodes, and then conditionally synthesize data to ensure that geographic consistency is maintained across the generated data.
from synthesized.config import HighDimConfig
config = HighDimConfig(
learn_postcodes=True,
address_locale="en_GB",
postcode_level=0,
)
The learn_postcodes
model must be set to True
to ensure that the SDK generates postcodes based of the original data,
rather than generating new examples. Currently, it is required to specify address_locale
. The postcode_level
argument can be set between 0-2, with the values matching the three
postcode levels described in the introduction.
By specifying postcode_level=0
in the HighDimConfig
object above, we are configuring the SDK to produce postcodes
that match only the first level of those seen in the original data, but new values for the final portion of the postcode.
We can observe this behaviour by training a HighDimSynthesizer
synth = HighDimSynthesizer(df_meta, config=config)
synth.learn(df)
and generating new data
df_synth = synth.synthesize(100)
df_synth
gender title first_name last_name email name_partner gender_partner postcode city street full_address 0 Female Ms Clare Hill clare.hill93@davies.com Dawn Riley Female KT32 5YB Port Daniel Deborah mission 51 Deborah mission, KT32 5YB Port Daniel 1 Female Mrs Rachael Dean rachael-dean@shepherd.com Henry Sims Male M3 3EN Antonychester Morgan lock 5 Morgan lock, M3 3EN Antonychester 2 Male Mr Mark Martin mark.martin@smith.com Martyn Smith Male PR62 3SH Rowleyberg Shaw trail 807 Shaw trail, PR62 3SH Rowleyberg 3 Female Ms Jemma Warner jemma-warner@peters-howard.net Jacqueline Singh Female RH2J 5BZ South June Barry cliffs 45 Barry cliffs, RH2J 5BZ South June 4 Female Mrs Kelly Mahmood kelly-mahmood@morgan-cunningham.info Diane Mills Female DT3E 9SQ Dayview June pine 63 June pine, DT3E 9SQ Dayview ... ... ... ... ... ... ... ... ... ... ... ... 95 Male Mr Harry Thomas harry.thomas61@naylor.com Charlotte Lane Female DD8 0PW Dianaport Mohamed freeway 987 Mohamed freeway, DD8 0PW Dianaport 96 Non-Binary Mx Stephanie Hill stephanie-hill@lawson.com Jasmine Cooper Non-Binary DN2J 9PN Lake Rosston Robinson field 0 Robinson field, DN2J 9PN Lake Rosston 97 Male Mr Leslie Parker leslie.parker@evans-jackson.com Andrew Harrison Male NP0 1WX Leahfurt Connor stravenue 56 Connor stravenue, NP0 1WX Leahfurt 98 Female Ms Rosemary Vincent rosemary_vincent@newton.com Conor Lewis Male TA5S 2JY Port Thomasside Christine keys 1 Christine keys, TA5S 2JY Port Thomasside 99 Male Mr Scott Roberts scott.roberts53@willis.org Joanne Holt Female L9N 2QB East Emilymouth Rebecca neck 494 Rebecca neck, L9N 2QB East Emilymouth 100 rows × 11 columns
However, we have generated brand new cities meaning that the postcodes are not consistent with the original geographic data present in the original data, specifically with regards to the fact that new cities have been generated. To ensure consistency between the generated data and the original, a user can conditionally generate address data:
# Using the original data to conditionally generate new data
df[["postcode", "city"]][0:4]
postcode city 0 AB10 1AB Aberdeen 1 IM1 1AG Isle of Man 2 TN34 2EZ Hastings 3 LA22 9HA Ambleside
df_address = synth._df_model.children[0].sample(conditions=df[["postcode", "city"]])
synth._df_model.children[0].meta.convert_df_for_children(df_address)
df_address
postcode city street full_address 0 LA22 8FE Ambleside Patel river 978 Patel river, LA22 8FE Ambleside 1 AB10 4ZU Aberdeen Atkins ridge 79 Atkins ridge, AB10 4ZU Aberdeen 2 IM1 6BE Isle of Man Brian coves 45 Brian coves, IM1 6BE Isle of Man 3 TN34 6SL Hastings Francesca forks 3 Francesca forks, TN34 6SL Hastings
The postcodes generated have the same starting portion as those of the original dataset, however, the later portions are completely new, synthetic postcodes.
Extending to different countries
The example above demonstrated how the SDK can be used to generate privacy compliant data for individuals with UK addresses.
However, both the PersonModel
and AddressModel
are extensible and can be used across multiple locales.
For example, the SDK can be used to generate brand new US addresses by specifying in the correct locale in the HighDimConfig
object:
config = HighDimConfig(
address_locale="en_US"
)
synth = HighDimSynthesizer(df_meta, config=config)
synth.learn(df)
synth.synthesize(100)
The list of currently supported locales can be found by calling the provided utility method:
from synthesized.model.models.address import all_supported_locales
all_supported_locales()
['az_AZ', 'cs_CZ', 'da_DK', 'de_AT', 'de_CH', 'de_DE', ... ]
It’s worth noting that currently if the dataset contains postcodes only a subset of the locales are supported. The list of supported locales including postcodes can be found by calling the provided utility method:
from synthesized.model.models.address import all_supported_postcode_locales
all_supported_postcode_locales()
['az_AZ', 'cs_CZ', 'da_DK', 'de_AT', 'de_CH', 'de_DE', ... ]