Privacy Masks

Synthesized supports attribute-level data anonymization - where information relating to a data subject (e.g. a clients name) is removed, thereby eliminating the possibility of identifying the data subject. The process is irreversible and achieved with data obfuscation. If needed, the user can obfuscate any data with any of the following techniques:

  1. Random strings. Generate random strings with similar format to input values, for example "490GH830L" could be transformed into "L3N8O3H2M".

  2. Hashing. Using a HMAC-SHA256 hashing algorithm, the input values are hashed into a fixed-length string. A secret string (which is either configured or randomly generated) is used to vary the hashed output.

  3. Nulling. The contents of a column can be completely removed, and the output dataset would contain an empty column.

  4. Binning. Individual values of attributes are replaced with a broader category. For example, the value "19" of the attribute "Age" may be replaced by " ≤ 20", the value "23" by "20 < Age ≤ 30" , etc.

  5. Date shifting. Shift dates by a random number of days.

  6. Time extraction. Extract/preserve a component of a datetime column as specified by a given config.

  7. Typos. Generate typos in the input values.

  8. Whitespace. Add or remove whitespace from the input values.

  9. Acronym. Replace the input values with their acronyms.

FormatPreservingMask

FormatPreservingMask can apply generic format-preserving hashing transformations for a given regex pattern.

from synthesized.privacy import FormatPreservingMask
import pandas as pd

df = pd.DataFrame({'Id': ['AAA001', 'BBB001', 'AAA002', 'AAA001', 'BBB001']})
transformer = FormatPreservingMask('Id', pattern=r'[ABC]{3}\d{3}')
transformer.fit(df)
transformer.transform(df)

IDs that are the same in the original DataFrame are also the same in the transformed DataFrame.

Table 1. format_preserving_mask.csv
Original Masked

AAA001

AAB474

BBB001

BCC715

AAA002

BBA282

AAA001

CAA225

BBB001

ACA623

Properties

  • name: Name of the column to mask.

  • pattern: Regex pattern of the string to be generated.

  • seed (optional): Random seed for the transformer.

HashingMask

HashingMask can apply generic HMAC-SHA256 hashing transformations for a given regex pattern. An optional secret string can be provided as a seed for the hashing function. If this is not provided, a random string is generated and used as the seed.

Two HashingMask objects with the same seed will produce the same output for the same input. Different hashing masks without the same seed will produce different outputs for the same input.

from synthesized.privacy import HashingMask
import pandas as pd

df = pd.DataFrame({'Id': ['AAPL', 'AMZN', 'GOOG', 'AAPL', 'NTFX']})
transformer = HashingMask('Id', seed='some_secret')
transformer.fit(df)
transformer.transform(df.copy())

transformer2 = HashingMask('Id', seed='other_secret')
transformer2.fit_transform(df)  # different output to the first transformer.

IDs that are the same in the original DataFrame are encoded to the same value in the transformed DataFrame.

Table 2. hashing_mask.csv
Original Masked

AAPL

b1af07c139ce64efd19aff23ab605acb

AMZN

4b0038c8af4cc6bde3ba98e29044a5e9

GOOG

e53afe2ef7b5b0dfc0ed5fa88ab2bce9

AAPL

b1af07c139ce64efd19aff23ab605acb

NTFX

145e575dc0aafa07671afaa3cf1b988h

Properties

  • name: Name of the column to mask.

  • seed (optional): Random seed for the transformer.

NanMask

NanMask masks the data by nulling out a given column.

The following example illustrates it:

from synthesized.privacy import NanMask
import pandas as pd

df = pd.DataFrame({'card_no': ['490 508 10L', 'ff4sff4', 'jdj DFj 34', '123POFjd33', '2334 fgg4 223']})
transformer = NanMask(name='card_no')
transformer.fit(df)
transformer.transform(df)
Table 3. nan_mask.csv
Original Masked

490 508 10L

Null

ff4sff4

Null

jdj DFj 34

Null

123POFjd33

Null

2334 fgg4 223

Null

Properties

  • name: Name of the column to mask.

RoundingMask

RoundingMask masks a numerical column by binning the values to N bins. Arg bins determines the number of bins to bin the value range of the column, the default value is 20.

The following example illustrates it:

from synthesized.privacy import RoundingMask
import pandas as pd
import numpy as np

df = pd.DataFrame({'age': np.random.randint(1, 97, size=(5000,))})
transformer = RoundingMask(name='age', bins=10)
transformer.fit(df)
transformer.transform(df)
Table 4. rounding_mask.csv
Original Masked

5

(0.999, 11.0]

72

(69.0, 78.0]

67

(40.0, 50.0]

59

(59.0, 69.0]

60

(59.0, 69.0]

Properties

  • name: Name of the column to mask.

  • n_bins (optional): Number of bins to split the data into. Default 20.

DateShiftMask

DateShiftMask adds or subtracts days to datetime fields. It can add or subtract random numbers of days, or the same number of days, to all of the fields, or to groups of values (specified by some entity_col).

The following example illustrates a use case with the DateShiftMask:

from synthesized.privacy import DateShiftMask
import pandas as pd
df = pd.DataFrame({'date': ['5/06/2014', '10/06/2015', '18/06/2015', '7/07/2015', '14/07/2016'])})

transformer = DateShiftMask(name='date', upper_bound=10, lower_bound=10)
transformer.fit(df)
transformer.transform(df)
Table 5. date_shift_mask.csv
Original Masked

5/06/2014

29/05/2014

10/06/2015

12/06/2015

18/06/2015

17/06/2015

7/07/2015

10/07/2015

14/07/2016

15/07/2016

Properties

  • name: Name of the column to mask.

  • upper_bound (optional): Range of shift days forward. For example, 5 means dates are shifted at most 5 days into the future. Default 5.

  • lower_bound (optional): Range of shift days back. For example, 5 means dates are shifted at most 5 days into the past. Default 0.

  • maintain_diff (optional): Whether to maintain the time interval between events in a sequence. Default False.

  • maintain_order (optional): Whether to maintain the ordering of the sequence. Default True. Note: If maintain_diff=True then maintain_order is also set to True regardless of input.

  • entity_col (optional): Unique entities to groupby when maintaining the order/diff of events in a sequence. Default None.

TimeExtractionMask

TimeExtractionMask allows for the extraction of a specific section of a date or datetime field.

The following example illustrates it:

from synthesized.privacy import TimeExtractionMask

import pandas as pd
df = pd.DataFrame({'date': ['5/06/2014', '10/06/2015', '18/06/2015', '7/07/2015', '14/07/2016'])})

transformer = TimeExtractionMask(name='date', portion='year')
transformer.fit(df)
transformer.transform(df)
Table 6. time_extraction_mask.csv
Original Masked

5/06/2014

2014

10/06/2015

2015

18/06/2015

2015

7/07/2015

2015

14/07/2016

2016

Properties

  • name: Name of the column to mask.

  • portion (optional): The portion of the datetime to extract. Options include: "date", "dayofweek", "dayofyear", "hour", "microsecond", "minute", "month", "nanosecond", "quarter", "second", "time", "weekday", "weekofyear", and "year". Default "year".

TypoMask

TypoMask can apply generic and random typo transformations for a given input string.

The typos introduced are random and can be any of the following:

Types of typos

  • Missing character

  • Character swap

  • Nearby character (assuming a QWERTY keyboard)

  • Extra character

  • Similar character

  • Repeated character (repeating an already existing character)

  • Random space

from synthesized.privacy import TypoMask
import pandas as pd

df = pd.DataFrame({'Name': ['John', 'Doe', 'Jane', 'Smith', 'John']})
transformer = TypoMask(name='Name', typo_rate=0.5)
transformer.fit(df)
transformer.transform(df)

Random typos are generated for the input values at the specified rate.

Table 7. typo_mask.csv
Original Masked

John

Jonh

Doe

Doe

Jane

Janee

Smith

Smith

John

Johm

Properties

  • name: Name of the column to mask.

  • typo_rate: Rate of typos to introduce. Default 0.1.

  • seed (optional): Random seed for the transformer.

WhiteSpaceMask

WhitespaceMask can apply generic and random whitespace transformations for a given input string.

from synthesized.privacy import WhiteSpaceMask
import pandas as pd

df = pd.DataFrame({'Name': ['John', 'Doe', 'Jane', 'Smith', 'John']})
transformer = WhiteSpaceMask(name='Name', whitespace_rate=0.5)
transformer.fit(df)
transformer.transform(df)

Random typos are generated for the input values at the specified rate.

Table 8. whitespace_mask.csv
Original Masked

John

Jo hn

Doe

Doe

Jane

J ane

Smith

Smith

John

John

Properties

  • name: Name of the column to mask.

  • whitespace_rate: Rate of whitespace to modify. Default 0.1.

  • seed (optional): Random seed for the transformer.

AcronymMask

AcronymMask can apply generic acronym transformations for a given input string. The acronym is generated by taking the first letter of each word in the input string. The delimeters can be specified to separate the acronym letters. If more than one delimeter is specified, a random delimeter is chosen for each word from the list of supplied delimeters.

from synthesized.privacy import AcronymMask
import pandas as pd

df = pd.DataFrame({'Company': ["A Good Company", "A Bad Company", "A Great Company", "A Terrible Company", "A Wonderful Company"]})
transformer = AcronymMask(name='Name', delimeters=["", "-", "."])
transformer.fit(df)
transformer.transform(df)

Random delimeters are generated for the input values from the specified list.

Table 9. acronym_mask.csv
Original Masked

A Good Company

A.G.C

A Bad Company

ABC

A Great Company

A.G.C

A Terrible Company

A-T-C

A Wonderful Company

AWC

Properties

  • name: Name of the column to mask.

  • delimeters: List or string of delimeters to separate the acronym letters. Default "".

  • seed (optional): Random seed for the transformer.

MaskingFactory

MaskingFactory can be used to create a set of data masking transformers, as described above, to transform multiple columns of a DataFrame in the same function call. To demonstrate this, we will use the following example DataFrame:

from synthesized.privacy import MaskingFactory
from faker import Faker
import pandas as pd

fkr = Faker()
df = pd.DataFrame({'Username': [fkr.user_name() for _ in range(1000)],
                    'Name': [fkr.name() for _ in range(1000)],
                    'Password': [fkr.password() for _ in range(1000)],
                    'CreditCardNo': [fkr.credit_card_number() for _ in range(1000)],
                    'Age': [fkr.pyint(min_value=10, max_value=78) for _ in range(1000)],
                    'MonthlyIncome': [fkr.pyint(min_value=1000, max_value=10000) for _ in range(1000)]})

In order to create a set of transformers to act on this DataFrame a MaskingFactory object is first created. A config dictionary object is then supplied to the create_masks() method of this factory, in order to specify the masking technique to use on specific columns in the DataFrame:

masking_factory = MaskingFactory()
config = {
    "rounding": [
        {"name": "Age",
        "bins": "20"},
        {"name": "MonthlyIncome",
        "bins": "3"}
    ],
    "nan": [
        {"name": "Password"},
        {"name" : "CreditCardNo"}
    ]
}
privacy_masks = masking_factory.create_masks(df, config)
privacy_masks.fit(df)
privacy_masks.transform(df, inplace=True)

The possible keys for the config dictionary are the appropriate keywords for each masking technique, as described above. The values are then lists containing dictionaries that specify the name of the column to be masked, using the name keyword, as well as any additional arguments to be passed to the appropriate mask. For instance, in the above example, the columns Age and MonthlyIncome are masked using the RoundingMask with nbins set for each column independently. The columns Password and CreditCardNo are masked using the NanMask which requires no additional arguments.