Entity Annotation
Configuring an entity annotation is necessary to generate realistic fake personally identifiable information (PII) such as customer names and addresses. Synthesized does not currently automatically recognize fields that contain PII, and therefore the default behaviour will be to generate the original data from such fields. |
Tabular datasets often contain fields that when combined can describe a specific entity, such as a unique person or postal address. For example, consider the dataset below that contains customer PII
title | first_name | last_name | gender | amount | |
---|---|---|---|---|---|
Mr |
John |
Doe |
Male |
101.2 |
|
Mrs |
Jane |
Smith |
Female |
28.2 |
|
Dr |
Albert |
Taylor |
Male |
98.1 |
|
Ms |
Alice |
Smart |
Female |
150.3 |
The combination of ('title'
, 'first_name'
, 'last_name'
, 'gender'
, 'email'
)
describes a unique person in this data, and there are strict relationships
between these attributes. E.g When "title" is "Mrs" or "Ms" then "first_name"
will most likely contain a name given to females.
When it is important to maintain the correct description of an entity in the generated synthetic data, the dataset must be annotated to link the appropriate fields together. When annotated, Synthesized will learn to generate realistic entities as a whole, rather than independently generating individual attributes.
Currently, Synthesized can handle person, address, bank and generic formatted string entities.
Person
Generating synthetic PII for individuals in a dataset can be achieved by
defining a Person
annotation.
from synthesized import MetaExtractor, HighDimSynthesizer
from synthesized.metadata.value import Person
from synthesized.config import PersonLabels
The columns of a dataset that relate to the attributes of a person are
specified using PersonLabels
. This is used to
define the Person
values that synthesized
can then generate.
person = Person(
name='person',
labels=PersonLabels(
gender='gender',
title='title',
firstname='first_name',
lastname='last_name',
email='email'
)
)
df_meta = MetaExtractor.extract(df=data, annotations=[person])
It is possible to define multiple Person annotations if a dataset contains PII columns for more than one person. These must be created as separate Person objects with unique names, and then passed to the list of annotations, e.g.
|
synthesizer = HighDimSynthesizer(df_meta=df_meta)
synthesizer.learn(...)
df_synthesized = synthesizer.synthesize(num_rows=...)
PersonModel
PersonModel
encapsulates the attributes of a person. When paired with a
Person
Meta, they are able to understand and learn about the attributes that
define a person and then generate data from their learned understanding.
PersonModel
captures gender using a Gender
model internally and can be
used to create the following attributes:
-
gender
(orig.)
-
title
(orig.)
-
first_name
-
last_name
-
email
-
username
-
password
-
home_number
-
work_number
-
mobile_number
Attributes marked with 'orig.'
have values that correspond to the original
dataset. The rest are intelligently generated based on the hidden model for the
hidden attribute, _gender
= {"F", "M", "NB", "A"}.
There are 3 special configuration cases for this model that should be considered:
-
The attribute
gender
is present: In this case, the hidden model for_gender
is based directly on thegender
attribute. All values in thegender
attribute should correspond to "F", "M", "U" or <NA>. In other words, there should be no ambiguous values in the collection "A". -
No
gender
present buttitle
is present: The hidden model for_gender
can be based on the available titles. As this is not a direct correspondence, not all values will correspond to a single collection. In other words, there may be some ambiguous values in the collection "A". -
Neither
gender
nortitle
are present: The hidden model for gender cannot be fitted to the data and so the_gender
attribute is assumed to be evenly distributed amongst the genders specified in the config.
E.g. person_locale = 'ru_RU' will refer to people belonging to Russia This can be quite useful to synthesize details of people belonging to a particular locality. |
import pandas as pd
import numpy as np
from synthesized.metadata.factory import MetaExtractor
from synthesized.config import PersonModelConfig, PersonLabels
from synthesized.metadata.value import Person
from synthesized.model.models import PersonModel
meta = Person('person', labels=PersonLabels(title='title', gender='gender', fullname='name',
firstname='firstname', lastname='lastname'))
person_model_config = PersonModelConfig()
person_model_config.person_locale='zh_CN'
model = PersonModel(meta=meta, config=person_model_config)
df = pd.DataFrame({
'gender': np.random.choice(['m', 'f', 'u'], size=100),
'title': np.random.choice(['mr', 'mr.', 'mx', 'miss', 'Mrs'], size=100)
})
df[[c for c in model.params.values() if c not in df.columns]] = 'test'
model.meta.revert_df_from_children(df)
model.fit(df)
model.sample(3)
Address
Similarly, an Address
annotation allows
Synthesized to generate fake address details. Currently, only UK addresses can
be generated.
from synthesized.metadata.value import Address
from synthesized.config import AddressLabels
The columns of a dataset that relate to the attributes of an address are
specified using AddressLabels
.
address = Address(
name='address',
labels=AddressLabels(
postcode='postcode',
county='county',
city='city',
district='district',
street='street_name',
house_number='house_number'
)
)
df_meta = MetaExtractor.extract(df=data, annotations=[address])
AddressModel
AddressModel
models addresses. It uses
Address
meta, which represents columns
with different address labels, such as city, house_number, postcode,
full_address, etc., to capture all the information needed to recreate similar
synthetic data.
AddressModelConfig
can also be provided as a part of the initialization.
AddressModelConfig
contains information, such as whether or not an address
file is provided, or if the postcodes need to be learned for address synthesis.
|
E.g. for postcode "EC2A 2DP":
|
from synthesized.metadata.value import Address
from synthesized.config import AddressLabels, AddressModelConfig
from synthesized.model.models import AddressModel
config = AddressModelConfig(addresses_file=None, learn_postcodes=False)
df = pd.DataFrame({
'postcode': ["" for _ in range(10)],
'street': ["" for _ in range(10)],
'full_address': ["" for _ in range(10)],
'city': ["" for _ in range(10)]
})
annotations = [Address(
name='Address',
nan_freq=0.3,
labels=AddressLabels(
postcode='postcode', city='city',
street='street', full_address='full_address'
)
)]
meta = MetaExtractor.extract(df, annotations=annotations)
model = AddressModel(meta['Address'], config=config)
model.meta.revert_df_from_children(df)
model.fit(df)
model.sample(n=3)
from faker import Faker
from synthesized.metadata.value import Address
from synthesized.config import AddressLabels, AddressModelConfig
from synthesized.model.models import AddressModel
address_file_path = 'data/addresses.jsonl.gz'
config = AddressModelConfig(addresses_file=address_file_path, learn_postcodes=True)
fkr = Faker('en_GB')
df = pd.DataFrame({
'postcode': [fkr.postcode() for _ in range(10)],
'street': [fkr.street_name() for _ in range(10)],
'full_address': [fkr.address() for _ in range(10)],
'city': [fkr.city() for _ in range(10)]
})
annotations = [Address(name='Address', nan_freq=0.3,
labels=AddressLabels(postcode='postcode', city='city',
street='street', full_address='full_address'))]
meta = MetaExtractor.extract(df, annotations=annotations)
model = AddressModel(meta['Address'], config=config)
model.meta.revert_df_from_children(df)
model.fit(df)
model.sample(n=3)
Bank
Defining a Bank
annotation allows
Synthesized to generate fake bank account numbers and sort codes. Currently,
Synthesized can only generate 8-digit account numbers and 6-digit sort codes.
from synthesized.metadata.value import Bank
from synthesized.config import BankLabels
The columns of a dataset that relate to the bank account attributes are
specified using BankLabels
.
bank = Bank(
name='bank',
labels=BankLabels(
sort_code='sort_code',
account='account_number'
)
)
FormattedString
A FormattedString
annotation can be used
to generate synthetic data that conforms to a given regular expression, e.g
social security numbers, or customer account numbers that have a specific
format.
from synthesized.metadata.value.categorical import FormattedString
The FormattedString
is defined by passing the respective column name,
and a regex pattern:
regex = "^(?!666|000|9\\d{2})\\d{3}-(?!00)\\d{2}-(?!0{4})\\d{4}$";
social_security = FormattedString(name="social_security_number", pattern=regex)
df_meta = MetaExtractor.extract(df=data, annotations=[social_security])