Privacy Masks

Synthesized supports attribute-level data anonymization - where information relating to a data subject (e.g. a clients name) is removed, thereby eliminating the possibility of identifying the data subject. The process is irreversible and achieved with data obfuscation. If needed, the user can obfuscate any data with any of the following techniques:

  1. Random strings. Generate random strings with similar format to input values, for example "490GH830L" could be transformed into "L3N8O3H2M".

  2. Hashing. Using a HMAC-SHA256 hashing algorithm, the input values are hashed into a fixed-length string. A secret string (which is either configured or randomly generated) is used to vary the hashed output.

  3. Nulling. The contents of a column can be completely removed, and the output dataset would contain an empty column.

  4. Binning. Individual values of attributes are replaced with a broader category. For example, the value "19" of the attribute "Age" may be replaced by " ≤ 20", the value "23" by "20 < Age ≤ 30" , etc.

FormatPreservingMask

FormatPreservingMask can apply generic format-preserving hashing transformations for a given regex pattern.

from synthesized.privacy import FormatPreservingMask
import pandas as pd

df = pd.DataFrame({'Id': ['AAA001', 'BBB001', 'AAA002', 'AAA001', 'BBB001']})
transformer = FormatPreservingMask('Id', pattern=r'[abc]-\d{3}')
transformer.fit(df)
transformer.transform(df)

IDs that are the same in the original DataFrame are also the same in the transformed DataFrame.

HashingMask

HashingMask can apply generic HMAC-SHA256 hashing transformations for a given regex pattern. An optional secret string can be provided as a seed for the hashing function. If this is not provided, a random string is generated and used as the seed.

Two HashingMask objects with the same seed will produce the same output for the same input. Different hashing masks without the same seed will produce different outputs for the same input.

from synthesized.privacy import HashingMask
import pandas as pd

df = pd.DataFrame({'Id': ['AAA001', 'BBB001', 'AAA002', 'AAA001', 'BBB001']})
transformer = HashingMask('Id', seed='some_secret')
transformer.fit(df)
transformer.transform(df.copy())

transformer2 = HashingMask('Id', seed='other_secret')
transformer2.fit_transform(df)  # different output to the first transformer.

IDs that are the same in the original DataFrame are also the same in the transformed DataFrame.

NanMask

NanMask masks the data by nulling out a given column.

The following example illustrates it:

from synthesized.privacy import NanMask
import pandas as pd

df = pd.DataFrame({'card_no': ['490 508 10L', 'ff4sff4', 'jdj DFj 34', '123POFjd33', '2334 fgg4 223', 'djdjjf 83838jd83', '123 453']})
transformer = NanMask(name='card_no')
transformer.fit(df)
transformer.transform(df)

RoundingMask

RoundingMask masks a numerical column by binning the values to N bins. Arg bins determines the number of bins to bin the value range of the column, the default value is 20.

The following example illustrates it:

from synthesized.privacy import RoundingMask
import pandas as pd
import numpy as np

df = pd.DataFrame({'age': np.random.randint(1, 97, size=(5000,))})
transformer = RoundingMask(name='age', bins=10)
transformer.fit(df)
transformer.transform(df)

MaskingFactory

MaskingFactory can be used to create a set of data masking transformers, as described above, to transform multiple columns of a DataFrame in the same function call. To demonstrate this, we will use the following example DataFrame:

from synthesized.privacy import MaskingFactory
from faker import Faker
import pandas as pd

fkr = Faker()
df = pd.DataFrame({'Username': [fkr.user_name() for _ in range(1000)],
                    'Name': [fkr.name() for _ in range(1000)],
                    'Password': [fkr.password() for _ in range(1000)],
                    'CreditCardNo': [fkr.credit_card_number() for _ in range(1000)],
                    'Age': [fkr.pyint(min_value=10, max_value=78) for _ in range(1000)],
                    'MonthlyIncome': [fkr.pyint(min_value=1000, max_value=10000) for _ in range(1000)]})

In order to create a set of transformers to act on this DataFrame a MaskingFactory object is first created. A config dictionary object is then supplied to the create_masks() method of this factory, in order to specify the masking technique to use on specific columns in the DataFrame:

masking_factory = MaskingFactory()
config = {
    "rounding": [
        {"name": "Age",
        "bins": "20"},
        {"name": "MonthlyIncome",
        "bins": "3"}
    ],
    "nan": [
        {"name": "Password"},
        {"name" : "CreditCardNo"}
    ]
}
privacy_masks = masking_factory.create_masks(df, config)
privacy_masks.fit(df)
privacy_masks.transform(df, inplace=True)

The possible keys for the config dictionary are the appropriate keywords for each masking technique, as described above. The values are then lists containing dictionaries that specify the name of the column to be masked, using the name keyword, as well as any additional arguments to be passed to the appropriate mask. For instance, in the above example, the columns Age and MonthlyIncome are masked using the RoundingMask with nbins set for each column independently. The columns Password and CreditCardNo are masked using the NanMask which requires no additional arguments.