Privacy Masks

Synthesized provides a variety of masks to anonymize parts of data for privacy purposes. The privacy masks replace the most identifying fields within data record with an artificial pseudonym.

Synthesized enables data masking through the following transformers:

NullTransformer

NullTransformer masks the data by nulling out a given column.

The following example illustrates it:

In [1]: import pandas as pd

In [2]: from synthesized.privacy import NullTransformer

In [3]: df = pd.DataFrame({'card_no': ['490 508 10L', 'ff4sff4', 'jdj DFj 34', '123POFjd33', '2334 fgg4 223', 'djdjjf 83838jd83', '123 453']})

In [4]: transformer = NullTransformer(name='card_no')

In [5]: transformer.fit(df)
Out[5]: NullTransformer(name="card_no")

In [6]: df_transformed = transformer.transform(df.copy())

In [7]: df_transformed.head()
Out[7]: 
  card_no
0        
1        
2        
3        
4        

PartialTransformer

PartialTransformer performs data masking by masking out the first 75% (or N%) of each sample for the given column. Arg masking_proportion determines what percentage of each sample will be masked.

The following example illustrates it:

In [8]: from synthesized.privacy import PartialTransformer

In [9]: df = pd.DataFrame({'account_num': ['49050810L', 'ff4sff4', 'jdjjdjDFj34', '123POFjd33', 'djB88ndjK93', '2234dr',
   ...:                   'DER44', '2334 fgg4 223', 'djdjjf 83838jd83', 'djjdjd093k']})
   ...: 

In [10]: transformer = PartialTransformer(name='account_num', masking_proportion=0.8)

In [11]: transformer.fit(df)
Out[11]: PartialTransformer(name="account_num")

In [12]: df_transformed = transformer.transform(df.copy())

In [13]: df_transformed.head()
Out[13]: 
   account_num
0    xxxxxxxxL
1      xxxxxx4
2  xxxxxxxxx34
3   xxxxxxxx33
4  xxxxxxxxx93

RandomTransformer

RandomTransformer masks a column by replacing the column values with a random string with slight format consistency. Arg str_length determines what length of the random string will be generated.

Note

Depending on whether the column values contain upper case character, lower case character and/or numeric character, the random values generated will or will not contain these.

In [14]: from synthesized.privacy import RandomTransformer

In [15]: df = pd.DataFrame({'Id': ['49050810L', 'D44J322K', 'FK53MDK3', '9FNF43MD', 'SJ42KDK4']})

In [16]: transformer = RandomTransformer(name='Id', str_length=7)

In [17]: transformer.fit(df)
Out[17]: RandomTransformer(name="Id")

In [18]: df_transformed = transformer.transform(df.copy())

In [19]: df_transformed.head()
Out[19]: 
        Id
0  FGR1YIS
1  U3MZOX2
2  ZNHXZJT
3  RCUJ2ZR
4  WDS0L2F

Since the 'Id' column values have numeric character and upper case characters, hence, the transformed column values will have numeric character and upper case characters.

RoundingTransformer

RoundingTransformer masks a numerical column by binning the values to N bins. Arg n_bins determines the number of bins to bin the value range of the column, the default value is 20.

The following example illustrates it:

In [20]: from synthesized.privacy import RoundingTransformer

In [21]: df = pd.DataFrame({'age': np.random.randint(1, 97, size=(5000,))})

In [22]: transformer = RoundingTransformer(name='age', n_bins=10)

In [23]: transformer.fit(df)
Out[23]: RoundingTransformer(name="age")

In [24]: df_transformed = transformer.transform(df.copy())

In [25]: df_transformed.head()
Out[25]: 
            age
0   (9.0, 19.0]
1  (77.0, 87.0]
2  (0.999, 9.0]
3  (87.0, 96.0]
4  (67.0, 77.0]

SwappingTransformer

SwappingTransformer masks by shuffling the categories around in a given categorical column. Boolean arg uniform determines if the categories should be distributed uniformly or if the existing proportion of categories in the column should be maintained.

The following example shows it:

In [26]: from synthesized.privacy import SwappingTransformer

In [27]: df = pd.DataFrame({'wday': np.random.choice(['mon', 'tues', 'wed', 'thur', 'fri', 'sat', 'sun'],
   ....:                   size=100)})
   ....: 

In [28]: transformer = SwappingTransformer(name='wday', uniform=True) # for uniform=True, the weekdays will be distributed uniformly in the transformed column

In [29]: transformer.fit(df)
Out[29]: SwappingTransformer(name=wday, dtypes=None)

In [30]: df_transformed = transformer.transform(df.copy())

In [31]: df_transformed.head()
Out[31]: 
   wday
0   mon
1   wed
2   fri
3   fri
4  thur

MaskingTransformerFactory

MaskingTransformerFactory can be used to transform the same or multiple columns of a DataFrame using the above data masking transformers.

The following example illustrates it:

In [32]: from faker import Faker

In [33]: from synthesized.privacy import MaskingTransformerFactory

In [34]: fkr = Faker()

In [35]: df = pd.DataFrame({'Username': [fkr.user_name() for _ in range(1000)],
   ....:                     'Name': [fkr.name() for _ in range(1000)],
   ....:                     'Password': [fkr.password() for _ in range(1000)],
   ....:                     'CreditCardNo': [fkr.credit_card_number() for _ in range(1000)],
   ....:                     'Age': [fkr.pyint(min_value=10, max_value=78) for _ in range(1000)],
   ....:                     'MonthlyIncome': [fkr.pyint(min_value=1000, max_value=10000) for _ in range(1000)]})
   ....: 

In [36]: df.head()
Out[36]: 
      Username             Name  ... Age MonthlyIncome
0     austin53     Gary Simpson  ...  71          6865
1     philip46      Chase Burke  ...  74          7147
2  colemangary       Alicia Lee  ...  11          3294
3     morgan52  Yolanda Carlson  ...  63          8490
4      maria46   Angela Johnson  ...  64          6621

[5 rows x 6 columns]

Next, create a config dictionary with the key as the column name to which the transformation is to be applied and the value is the name of the transformation to be applied. Arguments to the transformer can be provided using '|' operator.

The config dictionary is passed in the call of method create_transformers of the MaskingTransformerFactory object. This method returns a DataFrameTransformer which can then be used to fit and transform the dataset.

In [37]: config = dict(Age='rounding',
   ....:               MonthlyIncome='rounding|3',
   ....:               Username='partial_masking|0.25',
   ....:               CreditCardNo='partial_masking',
   ....:               Name='random',
   ....:               Password='null')
   ....: 

In [38]: mt_factory = MaskingTransformerFactory()

In [39]: dfm_transformer = mt_factory.create_transformers(config)

In [40]: dfm_transformer.fit(df)
Out[40]: DataFrameTransformer(name="df", dtypes=None, transformers=[RoundingTransformer(name="Age"), RoundingTransformer(name="MonthlyIncome"), PartialTransformer(name="Username"), PartialTransformer(name="CreditCardNo"), RandomTransformer(name="Name"), NullTransformer(name="Password")])

In [41]: masked_df = dfm_transformer.transform(df)

In [42]: masked_df.head()
Out[42]: 
             Age       MonthlyIncome  ...        Name Password
0   (67.0, 72.0]    (4181.0, 7087.0]  ...  FXqHLKKVjr         
1   (72.0, 75.0]    (7087.0, 9990.0]  ...  GBTgEDhXww         
2  (9.999, 13.0]  (1019.999, 4181.0]  ...  rdlAgtIUZd         
3   (60.0, 64.0]    (7087.0, 9990.0]  ...  WVCauEJYCl         
4   (60.0, 64.0]    (4181.0, 7087.0]  ...  ANDqajDjCd         

[5 rows x 6 columns]