Privacy Masks
Synthesized provides a variety of masks to anonymize parts of data for privacy purposes. The privacy masks replace the most identifying fields within a data record with an artificial pseudonym.
Synthesized enables data masking through the following transformers:
FormatPreservingTransformer
FormatPreservingTransformer
can apply generic format-preserving hashing transformations
for a given regex pattern.
from synthesized.privacy import FormatPreservingTransformer
import pandas as pd
df = pd.DataFrame({'Id': ['AAA001', 'BBB001', 'AAA002', 'AAA001', 'BBB001']})
transformer = FormatPreservingTransformer('Id', pattern=r'[abc]-\d{3}')
transformer.fit(df)
transformer.transform(df)
IDs that are the same in the original DataFrame are also the same in the transformed DataFrame.
NullTransformer
NullTransformer
masks the data by nulling out a given column.
The following example illustrates it:
from synthesized.privacy import NullTransformer
import pandas as pd
df = pd.DataFrame({'card_no': ['490 508 10L', 'ff4sff4', 'jdj DFj 34', '123POFjd33', '2334 fgg4 223', 'djdjjf 83838jd83', '123 453']})
transformer = NullTransformer(name='card_no')
transformer.fit(df)
transformer.transform(df)
PartialTransformer
PartialTransformer
performs data masking by masking out the first 75% (or N%) of each sample for the given column.
Arg masking_proportion
determines what percentage of each sample will be masked.
The following example illustrates it:
from synthesized.privacy import PartialTransformer
import pandas as pd
df = pd.DataFrame({'account_num': ['49050810L', 'ff4sff4', 'jdjjdjDFj34', '123POFjd33', 'djB88ndjK93', '2234dr',
'DER44', '2334 fgg4 223', 'djdjjf 83838jd83', 'djjdjd093k']})
transformer = PartialTransformer(name='account_num', masking_proportion=0.8)
transformer.fit(df)
transformer.transform(df)
RandomTransformer
RandomTransformer
masks a column by replacing the column values with a random string with slight format consistency.
Arg str_length
determines the length of the random string that will be generated.
Depending on whether the column values contain upper case characters, lower case characters and/or numeric characters, the random values generated will or will not contain these. |
from synthesized.privacy import RandomTransformer
import pandas as pd
df = pd.DataFrame({'Id': ['49050810L', 'D44J322K', 'FK53MDK3', '9FNF43MD', 'SJ42KDK4']})
transformer = RandomTransformer(name='Id', str_length=7)
transformer.fit(df)
transformer.transform(df)
Since the 'Id' column values have numeric and upper case characters, the transformed columns will also have numeric and upper case characters.
RoundingTransformer
RoundingTransformer
masks a numerical column by binning the values to N bins.
Arg n_bins
determines the number of bins to bin the value range of the column, the default value is 20.
The following example illustrates it:
from synthesized.privacy import RoundingTransformer
import pandas as pd
import numpy as np
df = pd.DataFrame({'age': np.random.randint(1, 97, size=(5000,))})
transformer = RoundingTransformer(name='age', n_bins=10)
transformer.fit(df)
transformer.transform(df)
SwappingTransformer
SwappingTransformer
masks by shuffling the categories around in a given categorical column.
Boolean arg uniform
determines if the categories should be distributed uniformly or if the
existing proportion of categories in the column should be maintained.
The following example shows it:
from synthesized.privacy import SwappingTransformer
import pandas as pd
import numpy as np
df = pd.DataFrame({'wday': np.random.choice(['mon', 'tues', 'wed', 'thur', 'fri', 'sat', 'sun'],
size=100)})
transformer = SwappingTransformer(name='wday', uniform=True) # for uniform=True, the weekdays will be distributed uniformly in the transformed column
transformer.fit(df)
transformer.transform(df)
MaskingTransformerFactory
MaskingTransformerFactory
can be used to transform the same or multiple columns of a DataFrame using the above
data masking transformers.
The following example illustrates it:
from synthesized.privacy import MaskingTransformerFactory
from faker import Faker
import pandas as pd
fkr = Faker()
df = pd.DataFrame({'Username': [fkr.user_name() for _ in range(1000)],
'Name': [fkr.name() for _ in range(1000)],
'Password': [fkr.password() for _ in range(1000)],
'CreditCardNo': [fkr.credit_card_number() for _ in range(1000)],
'Age': [fkr.pyint(min_value=10, max_value=78) for _ in range(1000)],
'MonthlyIncome': [fkr.pyint(min_value=1000, max_value=10000) for _ in range(1000)]})
Next, create a config dictionary where the keys are column names and the values are the name of the transformation to be applied to that column. Arguments to the transformer can be provided using '|' operator.
The config dictionary is passed in the call of method create_transformers
of the MaskingTransformerFactory
object. This method returns a DataFrameTransformer
which can then be used to fit and transform the dataset.
config = dict(
Age='rounding',
MonthlyIncome='rounding|3',
Username='partial_masking|0.25',
CreditCardNo='partial_masking',
Name='random',
Password='null'
)
mt_factory = MaskingTransformerFactory()
dfm_transformer = mt_factory.create_transformers(config)
dfm_transformer.fit(df)
dfm_transformer.transform(df, inplace=True)