Privacy Masks#

Synthesized provides a variety of masks to anonymize parts of data for privacy purposes. The privacy masks replace the most identifying fields within a data record with an artificial pseudonym.

Synthesized enables data masking through the following transformers:

FormatPreservingTransformer#

FormatPreservingTransformer can apply generic format-preserving hashing transformations for a given regex pattern.

In [1]: from synthesized.privacy import FormatPreservingTransformer

In [2]: df = pd.DataFrame({'Id': ['AAA001', 'BBB001', 'AAA002', 'AAA001', 'BBB001']})

In [3]: transformer = FormatPreservingTransformer('Id', pattern=r'[abc]-\d{3}')

In [4]: transformer.fit(df)
Out[4]: FormatPreservingTransformer(name="Id")

In [5]: transformer.transform(df)
Out[5]: 
      Id
0  c-369
1  b-144
2  a-493
3  c-369
4  b-144

IDs that are the same in the original DataFrame are also the same in the transformed DataFrame.

NullTransformer#

NullTransformer masks the data by nulling out a given column.

The following example illustrates it:

In [6]: import pandas as pd

In [7]: from synthesized.privacy import NullTransformer

In [8]: df = pd.DataFrame({'card_no': ['490 508 10L', 'ff4sff4', 'jdj DFj 34', '123POFjd33', '2334 fgg4 223', 'djdjjf 83838jd83', '123 453']})

In [9]: transformer = NullTransformer(name='card_no')

In [10]: transformer.fit(df)
Out[10]: NullTransformer(name="card_no")

In [11]: transformer.transform(df)
Out[11]: 
  card_no
0        
1        
2        
3        
4        
5        
6        

PartialTransformer#

PartialTransformer performs data masking by masking out the first 75% (or N%) of each sample for the given column. Arg masking_proportion determines what percentage of each sample will be masked.

The following example illustrates it:

In [12]: from synthesized.privacy import PartialTransformer

In [13]: df = pd.DataFrame({'account_num': ['49050810L', 'ff4sff4', 'jdjjdjDFj34', '123POFjd33', 'djB88ndjK93', '2234dr',
   ....:                   'DER44', '2334 fgg4 223', 'djdjjf 83838jd83', 'djjdjd093k']})
   ....: 

In [14]: transformer = PartialTransformer(name='account_num', masking_proportion=0.8)

In [15]: transformer.fit(df)
Out[15]: PartialTransformer(name="account_num")

In [16]: transformer.transform(df)
Out[16]: 
        account_num
0         xxxxxxxxL
1           xxxxxx4
2       xxxxxxxxx34
3        xxxxxxxx33
4       xxxxxxxxx93
5            xxxxxr
6             xxxx4
7     xxxxxxxxxxx23
8  xxxxxxxxxxxxxd83
9        xxxxxxxx3k

RandomTransformer#

RandomTransformer masks a column by replacing the column values with a random string with slight format consistency. Arg str_length determines the length of the random string that will be generated.

Note

Depending on whether the column values contain upper case characters, lower case characters and/or numeric characters, the random values generated will or will not contain these.

In [17]: from synthesized.privacy import RandomTransformer

In [18]: df = pd.DataFrame({'Id': ['49050810L', 'D44J322K', 'FK53MDK3', '9FNF43MD', 'SJ42KDK4']})

In [19]: transformer = RandomTransformer(name='Id', str_length=7)

In [20]: transformer.fit(df)
Out[20]: RandomTransformer(name="Id")

In [21]: transformer.transform(df)
Out[21]: 
        Id
0  OSZW51B
1  TRX4QHM
2  F99OGPB
3  Y2JGIC6
4  JA2VQBK

Since the ‘Id’ column values have numeric and upper case characters, the transformed columns will also have numeric and upper case characters.

RoundingTransformer#

RoundingTransformer masks a numerical column by binning the values to N bins. Arg n_bins determines the number of bins to bin the value range of the column, the default value is 20.

The following example illustrates it:

In [22]: from synthesized.privacy import RoundingTransformer

In [23]: df = pd.DataFrame({'age': np.random.randint(1, 97, size=(5000,))})

In [24]: transformer = RoundingTransformer(name='age', n_bins=10)

In [25]: transformer.fit(df)
Out[25]: RoundingTransformer(name="age")

In [26]: transformer.transform(df)
Out[26]: 
                age
0      (38.0, 47.0]
1     (0.999, 10.0]
2      (57.0, 67.0]
3      (20.0, 29.0]
4      (38.0, 47.0]
...             ...
4995   (10.0, 20.0]
4996   (77.0, 87.0]
4997   (47.0, 57.0]
4998   (29.0, 38.0]
4999   (67.0, 77.0]

[5000 rows x 1 columns]

SwappingTransformer#

SwappingTransformer masks by shuffling the categories around in a given categorical column. Boolean arg uniform determines if the categories should be distributed uniformly or if the existing proportion of categories in the column should be maintained.

The following example shows it:

In [27]: from synthesized.privacy import SwappingTransformer

In [28]: df = pd.DataFrame({'wday': np.random.choice(['mon', 'tues', 'wed', 'thur', 'fri', 'sat', 'sun'],
   ....:                   size=100)})
   ....: 

In [29]: transformer = SwappingTransformer(name='wday', uniform=True) # for uniform=True, the weekdays will be distributed uniformly in the transformed column

In [30]: transformer.fit(df)
Out[30]: SwappingTransformer(name=wday, dtypes=None)

In [31]: transformer.transform(df)
Out[31]: 
    wday
0   tues
1    wed
2    sat
3    wed
4    fri
..   ...
95  tues
96  tues
97   sat
98  tues
99   wed

[100 rows x 1 columns]

MaskingTransformerFactory#

MaskingTransformerFactory can be used to transform the same or multiple columns of a DataFrame using the above data masking transformers.

The following example illustrates it:

In [32]: from faker import Faker

In [33]: from synthesized.privacy import MaskingTransformerFactory

In [34]: fkr = Faker()

In [35]: df = pd.DataFrame({'Username': [fkr.user_name() for _ in range(1000)],
   ....:                     'Name': [fkr.name() for _ in range(1000)],
   ....:                     'Password': [fkr.password() for _ in range(1000)],
   ....:                     'CreditCardNo': [fkr.credit_card_number() for _ in range(1000)],
   ....:                     'Age': [fkr.pyint(min_value=10, max_value=78) for _ in range(1000)],
   ....:                     'MonthlyIncome': [fkr.pyint(min_value=1000, max_value=10000) for _ in range(1000)]})
   ....: 

In [36]: df.head()
Out[36]: 
         Username                Name  ... Age MonthlyIncome
0     parkerjames         Adrian West  ...  20          1449
1  pattersonjacob         Laura Wolfe  ...  33          6952
2    laceyjohnson  Rachael Livingston  ...  22          3515
3         megan12      Manuel Mullins  ...  23          5529
4         qlowery    Gregory Lawrence  ...  64          8544

[5 rows x 6 columns]

Next, create a config dictionary where the keys are column names and the values are the name of the transformation to be applied to that column. Arguments to the transformer can be provided using ‘|’ operator.

The config dictionary is passed in the call of method create_transformers of the MaskingTransformerFactory object. This method returns a DataFrameTransformer which can then be used to fit and transform the dataset.

In [37]: config = dict(Age='rounding',
   ....:               MonthlyIncome='rounding|3',
   ....:               Username='partial_masking|0.25',
   ....:               CreditCardNo='partial_masking',
   ....:               Name='random',
   ....:               Password='null')
   ....: 

In [38]: mt_factory = MaskingTransformerFactory()

In [39]: dfm_transformer = mt_factory.create_transformers(config)

In [40]: dfm_transformer.fit(df)
Out[40]: DataFrameTransformer(name="df", dtypes=None, transformers=[RoundingTransformer(name="Age"), RoundingTransformer(name="MonthlyIncome"), PartialTransformer(name="Username"), PartialTransformer(name="CreditCardNo"), RandomTransformer(name="Name"), NullTransformer(name="Password")])

In [41]: dfm_transformer.transform(df, inplace=True)
Out[41]: 
              Age       MonthlyIncome  ...        Name Password
0    (19.0, 23.0]  (1009.999, 3932.0]  ...  ytUlGbkRPK         
1    (31.0, 34.0]    (3932.0, 7124.0]  ...  nKkiaoYYwN         
2    (19.0, 23.0]  (1009.999, 3932.0]  ...  BDIqziJTER         
3    (19.0, 23.0]    (3932.0, 7124.0]  ...  CflrggImgk         
4    (61.0, 65.0]    (7124.0, 9991.0]  ...  BDzFIqNyDY         
..            ...                 ...  ...         ...      ...
995  (61.0, 65.0]    (3932.0, 7124.0]  ...  VPZBGBDLHQ         
996  (15.0, 19.0]  (1009.999, 3932.0]  ...  muJkzTMujy         
997  (68.0, 71.0]    (7124.0, 9991.0]  ...  LBBnXQdLhm         
998  (23.0, 26.0]    (3932.0, 7124.0]  ...  CNABqsrouj         
999  (68.0, 71.0]    (3932.0, 7124.0]  ...  nQjwrnONog         

[1000 rows x 6 columns]