Entity Annotation#

Important

Configuring an entity annotation is necessary to generate realistic fake personally identifiable information (PII) such as customer names and addresses. Synthesized does not currently automatically recognize fields that contain PII, and therefore the default behaviour will be to generate the original data from such fields.

Tabular datasets often contain fields that when combined can describe a specific entity, such as a unique person or postal address. For example, consider the dataset below that contains customer PII

title

first_name

last_name

gender

email

amount

Mr

John

Doe

Male

john.doe@gmail.com

101.2

Mrs

Jane

Smith

Female

jane.smith@gmail.com

28.2

Dr

Albert

Taylor

Male

albert.taylor@aol.com

98.1

Ms

Alice

Smart

Female

alice.smart@hotmail.com

150.3

The combination of (‘title’, ‘first_name’, ‘last_name’, ‘gender’, ‘email’) describes a unique person in this data, and there are strict relationships between these attributes. E.g When “title” is “Mrs” or “Ms” then “first_name” will most likely contain a name given to females.

When it is important to maintain the correct description of an entity in the generated synthetic data, the dataset must be annotated to link the appropriate fields together. When annotated, Synthesized will learn to generate realistic entities as a whole, rather than independently generating individual attributes.

Currently, Synthesized can handle person, address, bank and generic formatted string entities.

Person#

Generating synthetic PII for individuals in a dataset can be achieved by defining a Person annotation.

In [1]: from synthesized.metadata.value import Person

In [2]: from synthesized.config import PersonLabels

The columns of a dataset that relate to the attributes of a person are specified using PersonLabels. This is used to define the Person values that synthesized can then generate.

In [3]: person = Person(
   ...:      name='person',
   ...:      labels=PersonLabels(
   ...:          gender_label='gender',
   ...:          title_label='title',
   ...:          firstname_label='first_name',
   ...:          lastname_label='last_name',
   ...:          email_label='email'
   ...:      )
   ...:  )
   ...: 
In [4]: df_meta = MetaExtractor.extract(df=data, annotations=[person])

Note

It is possible to define multiple Person annotations if a dataset contains PII columns for more than one person. These must be created as separate Person objects with unique names, and then passed to the list of annotations, e.g MetaExtractor.extract(df=..., annotations=[person_1, person_2])

In [5]: synthesizer = HighDimSynthesizer(df_meta=df_meta)

In [6]: synthesizer.learn(...)

In [7]: df_synthesized = synthesizer.synthesize(num_rows=...)

PersonModel#

PersonModel encapsulates the attributes of a person. When paired with a Person Meta, they are able to understand and learn about the attributes that define a person and then generate data from their learned understanding. PersonModel captures gender using a Gender model internally and can be used to create the following attributes:

  • gender (orig.)

  • title (orig.)

  • first_name

  • last_name

  • email

  • username

  • password

  • home_number

  • work_number

  • mobile_number

Attributes marked with ‘orig.’ have values that correspond to the original dataset. The rest are intelligently generated based on the hidden model for the hidden attribute, _gender = {“F”, “M”, “NB”, “A”}.

There are 3 special configuration cases for this model that should be considered:

1. The attribute gender is present: In this case, the hidden model for _gender is based directly on the gender attribute. All values in the gender attribute should correspond to “F”, “M”, “U” or <NA>. In other words, there should be no ambiguous values in the collection “A”. 2. No gender present but title is present: The hidden model for _gender can be based on the available titles. As this is not a direct correspondence, not all values will correspond to a single collection. In other words, there may be some ambiguous values in the collection “A”. 3. Neither gender nor title are present: The hidden model for gender cannot be fitted to the data and so the _gender attribute is assumed to be evenly distributed amongst the genders specified in the config.

Note

PersonModel can be provided PersonModelConfig during initialization. ‘person_locale’ is a member variable of the PersonModelConfig class which can be set to specify the locality of the people.

E.g. person_locale = ‘ru_RU’ will refer to people belonging to Russia
This can be quite useful to synthesize details of people belonging to a particular locality.
In [8]: import pandas as pd

In [9]: from synthesized.metadata.factory import MetaExtractor

In [10]: from synthesized.config import PersonModelConfig, PersonLabels

In [11]: from synthesized.metadata.value import Person

In [12]: from synthesized.model.models import PersonModel

In [13]: meta = Person('person', labels=PersonLabels(title_label='title', gender_label='gender', fullname_label='name',
   ....:                                firstname_label='firstname', lastname_label='lastname'))
   ....: 

In [14]: person_model_config = PersonModelConfig()

In [15]: person_model_config.person_locale='zh_CN'

In [16]: model = PersonModel(meta=meta, config=person_model_config)

In [17]: df = pd.DataFrame({'gender': np.random.choice(['m', 'f', 'u'], size=100), 'title': np.random.choice(['mr', 'mr.', 'mx', 'miss', 'Mrs'], size=100)})

In [18]: df[[c for c in model.params.values() if c not in df.columns]] = 'test'

In [19]: model.meta.revert_df_from_children(df)

In [20]: model.fit(df)
Out[20]: PersonModel(meta=<Nominal[U]: Person(name=person)>)

In [21]: model.sample(3)
Out[21]: 
  gender title       name firstname lastname
0      u    mx  丽娟 Harris        丽娟   Harris
1      u    mx  华 DuBuque         华  DuBuque
2      m   mr.   阳 Bednar         阳   Bednar

Address#

Similarly, an Address annotation allows Synthesized to generate fake address details. Currently, only UK addresses can be generated.

In [22]: from synthesized.metadata.value import Address

In [23]: from synthesized.config import AddressLabels

The columns of a dataset that relate to the attributes of an address are specified using AddressLabels.

In [24]: address = Address(
   ....:      name='address',
   ....:      labels=AddressLabels(
   ....:          postcode_label='postcode',
   ....:          county_label='county',
   ....:          city_label='city',
   ....:          district_label='district',
   ....:          street_label='street_name',
   ....:          house_number_label='house_number'
   ....:      )
   ....:  )
   ....: 
In [25]: df_meta = MetaExtractor.extract(df=data, annotations=[address])

AddressModel#

AddressModel models addresses. It uses Address meta, which represents columns with different address labels, such as city, house_number, postcode, full_address, etc., to capture all the information needed to recreate similar synthetic data.

AddressModelConfig can also be provided as a part of the initialization. AddressModelConfig contains information, such as whether or not an address file is provided, or if the postcodes need to be learned for address synthesis.

Note

AddressModel uses PostcodeModel to learn and synthesize the addresses. If an address file is provided then the addresses corresponding to the learned postcodes are sampled from the file. If an address file is not provided, then the Faker is used to generate addresses.

Tip

AddressModel class has a member variable ‘postcode_level’ which provides the flexibility to use a partial or full postcode for fitting and sampling.

E.g. for postcode “EC2A 2DP”:
postcode_level=0 will signify “EC”
postcode_level=1 will signify “EC2A”
postcode_level=2 will signify “EC2A 2DP”

Without address file#

In [26]: from synthesized.metadata.value import Address

In [27]: from synthesized.config import AddressLabels, AddressModelConfig

In [28]: from synthesized.model.models import AddressModel

In [29]: config = AddressModelConfig(addresses_file=None, learn_postcodes=False)

In [30]: df = pd.DataFrame({
   ....:     'postcode': ["" for _ in range(10)],
   ....:     'street': ["" for _ in range(10)],
   ....:     'full_address': ["" for _ in range(10)],
   ....:     'city': ["" for _ in range(10)]
   ....: })
   ....: 

In [31]: annotations = [Address(name='Address', nan_freq=0.3,
   ....:                labels=AddressLabels(postcode_label='postcode', city_label='city',
   ....:                                     street_label='street', full_address_label='full_address'))]
   ....: 

In [32]: meta = MetaExtractor.extract(df, annotations=annotations)

In [33]: model = AddressModel(meta['Address'], config=config)

In [34]: model.meta.revert_df_from_children(df)

In [35]: model.fit(df)
Out[35]: AddressModel(meta=<Nominal[U]: Address(name=Address)>)

In [36]: model.sample(n=3)
Out[36]: 
  postcode  ...                                       full_address
0  DA9 0TX  ...  Flat 39m 79 Walker drive, Callum manor, West G...
1  HS1 6GH  ...  Flat 18 31 Watkins alley, Glenn park, Davishav...
2  N90 8FF  ...  Studio 31 01 Roberts station, Chelsea greens, ...

[3 rows x 4 columns]

With address file#

In [37]: from faker import Faker

In [38]: address_file_path = 'data/addresses.jsonl.gz'

In [39]: config = AddressModelConfig(addresses_file=address_file_path, learn_postcodes=True)

In [40]: fkr = Faker('en_GB')

In [41]: df = pd.DataFrame({
   ....:     'postcode': [fkr.postcode() for _ in range(10)],
   ....:     'street': [fkr.street_name() for _ in range(10)],
   ....:     'full_address': [fkr.address() for _ in range(10)],
   ....:     'city': [fkr.city() for _ in range(10)]
   ....: })
   ....: 

In [42]: annotations = [Address(name='Address', nan_freq=0.3,
   ....:                labels=AddressLabels(postcode_label='postcode', city_label='city',
   ....:                                     street_label='street', full_address_label='full_address'))]
   ....: 

In [43]: meta = MetaExtractor.extract(df, annotations=annotations)

In [44]: model = AddressModel(meta['Address'], config=config)

In [45]: model.meta.revert_df_from_children(df)

In [46]: model.fit(df)
Out[46]: AddressModel(meta=<Nominal[U]: Address(name=Address)>)

In [47]: model.sample(n=3)
Out[47]: 
   postcode  ...                                    full_address
0  BR0M 7BT  ...         Flat 6\nLindsey neck\nGaryhaven\nE0 9YB
1   RH0 3XT  ...  3 Shaw throughway\nNorth Terencestad\nKY01 5WY
2   B93 3NR  ...       Flat 07\nJessica alley\nEllisbury\nG3 9UR

[3 rows x 4 columns]

Bank#

Defining a Bank annotation allows Synthesized to generate fake bank account numbers and sort codes. Currently, Synthesized can only generate 8-digit account numbers and 6-digit sort codes.

In [48]: from synthesized.metadata.value import Bank

In [49]: from synthesized.config import BankLabels

The columns of a dataset that relate to the bank account attributes are specified using BankLabels.

In [50]: bank = Bank(
   ....:      name='bank',
   ....:      labels=BankLabels(
   ....:         sort_code_label='sort_code',
   ....:         account_label='account_number'
   ....:      )
   ....:  )
   ....: 

FormattedString#

A FormattedString annotation can be used to generate synthetic data that conforms to a given regular expression, e.g social security numbers, or customer account numbers that have a specific format.

In [51]: from synthesized.metadata.value.categorical import FormattedString

The FormattedString is defined by passing the respective column name, and a regex pattern:

In [52]: regex = "^(?!666|000|9\\d{2})\\d{3}-(?!00)\\d{2}-(?!0{4})\\d{4}$";

In [53]: social_security = FormattedString(name="social_security_number", pattern=regex)
In [54]: df_meta = MetaExtractor.extract(df=data, type_overrides=[social_security])