Privacy Masks
Synthesized provides a variety of masks to anonymize parts of data for privacy purposes. The privacy masks replace the most identifying fields within a data record with an artificial pseudonym.
Synthesized enables data masking through the following transformers:
FormatPreservingMask
FormatPreservingMask
can apply generic format-preserving hashing transformations
for a given regex pattern.
from synthesized.privacy import FormatPreservingMask
import pandas as pd
df = pd.DataFrame({'Id': ['AAA001', 'BBB001', 'AAA002', 'AAA001', 'BBB001']})
transformer = FormatPreservingMask('Id', pattern=r'[abc]-\d{3}')
transformer.fit(df)
transformer.transform(df)
IDs that are the same in the original DataFrame are also the same in the transformed DataFrame.
NanMask
NanMask
masks the data by nulling out a given column.
The following example illustrates it:
from synthesized.privacy import NanMask
import pandas as pd
df = pd.DataFrame({'card_no': ['490 508 10L', 'ff4sff4', 'jdj DFj 34', '123POFjd33', '2334 fgg4 223', 'djdjjf 83838jd83', '123 453']})
transformer = NanMask(name='card_no')
transformer.fit(df)
transformer.transform(df)
RoundingMask
RoundingMask
masks a numerical column by binning the values to N bins.
Arg n_bins
determines the number of bins to bin the value range of the column, the default value is 20.
The following example illustrates it:
from synthesized.privacy import RoundingMask
import pandas as pd
import numpy as np
df = pd.DataFrame({'age': np.random.randint(1, 97, size=(5000,))})
transformer = RoundingMask(name='age', n_bins=10)
transformer.fit(df)
transformer.transform(df)
MaskingFactory
MaskingFactory
can be used to create a set of data masking transformers, as described above, to
transform multiple columns of a DataFrame in the same function call. To demonstrate this, we will use the following
example DataFrame:
from synthesized.privacy import MaskingFactory
from faker import Faker
import pandas as pd
fkr = Faker()
df = pd.DataFrame({'Username': [fkr.user_name() for _ in range(1000)],
'Name': [fkr.name() for _ in range(1000)],
'Password': [fkr.password() for _ in range(1000)],
'CreditCardNo': [fkr.credit_card_number() for _ in range(1000)],
'Age': [fkr.pyint(min_value=10, max_value=78) for _ in range(1000)],
'MonthlyIncome': [fkr.pyint(min_value=1000, max_value=10000) for _ in range(1000)]})
In order to create a set of transformers to act on this DataFrame a MaskingFactory
object is first created.
A config
dictionary object is then supplied to the create_masks()
method of this factory, in order to specify the masking
technique to use on specific columns in the DataFrame:
masking_factory = MaskingFactory()
config = {
"rounding": [
{"name": "Age",
"nbins": "20"},
{"name": "MonthlyIncome",
"nbins": "3"}
],
"nan": [
{"name": "Password"},
{"name" : "CreditCardNo"}
]
}
privacy_masks = masking_factory.create_masks(df, config)
privacy_masks.fit(df)
privacy_masks.transform(df, inplace=True)
The possible keys for the config dictionary are the appropriate keywords for each masking technique, as described above.
The values are then lists containing dictionaries that specify the name of the column to be masked, using the name
keyword,
as well as any additional arguments to be passed to the appropriate mask. For instance, in the above example, the
columns Age
and MonthlyIncome
are masked using the RoundingMask
with nbins
set for each column
independently. The columns Password
and CreditCardNo
are masked using the NanMask
which requires no additional arguments.