Masking

The sdk provides a variety of different masks that can be used to anonymize data, this tutorial will walk through an example of how to use the masks and then provide a list of the masks available, and their different options. All the masks provided are available in both spark and pandas and the example will show both.

All sdk masks are classes and their usage is straightforward. Once initialised, masking is applied to the data by calling transform(). The transform() method returns a new dataset with the specified column masked. Some of the masks are also invertable, where this is the case the method inverse_transform() performs the inverse transformation on the column specified. The usage with spark and pandas dataframes is identical!

Basic Usage

Dataset

The example dataset we will use is the 'claim_prediction' dataset which is a dataset of insurance claims with attributes of claimants, available in the synthesized_datasets package. The mask will be used to anonymize the 'charges' column of the dataset.

import synthesized_datasets as sd
data = sd.ALL.claim_prediction.load()
data.head()
....
   age  sex     bmi  children  smoker  region      charges  insuranceclaim
0   19    0  27.900         0       1       3  16884.92400               1
1   18    1  33.770         1       0       2   1725.55230               1
2   28    1  33.000         3       0       2   4449.46200               0
3   33    1  22.705         0       0       1  21984.47061               0
4   32    1  28.880         0       0       1   3866.85520               1
....

Masking Usage

As an example, let’s use the BucketingMask which buckets numerical data in order to de-identify columns where the precise value of the numerical data could be personally identifiable.

The BucketingMask can bucket the data into buckets of equal size within a specified range. To use this mode of operation, users should pass the arguments bucket_size, lower_bound and upper_bound when initialising the mask.

Click on the tabs below to see the example in pandas and spark!
  • Pandas

  • Spark

from synthesized3.mask import BucketingMask

# Create a mask that will bucket the data into buckets of size 1000 between 5000 and 50000
mask_equal_buckets = BucketingMask(bucket_size=1000, lower_bound=5000, upper_bound=50000)

# The masked data is returned by calling `transform()`. The original data is not modified.
mask_equal_buckets.transform(data, col='charges')
....
      age  sex     bmi  children  smoker  region          charges  insuranceclaim
0      19    0  27.900         0       1       3  16000.0:17000.0               1
1      18    1  33.770         1       0       2          <5000.0               1
2      28    1  33.000         3       0       2          <5000.0               0
3      33    1  22.705         0       0       1  21000.0:22000.0               0
4      32    1  28.880         0       0       1          <5000.0               1
...   ...  ...     ...       ...     ...     ...              ...             ...
....

We will convert the pandas dataframe to a spark dataframe for the spark example. We will work with a local spark session for this example.

import pyspark.sql as ps
spark = ps.sql.SparkSession.builder \
    .master("local[4]") \
    .appName("synthesized") \
    .getOrCreate()

# Convert the pandas dataframe to a spark dataframe:
data = spark.createDataFrame(data)
data.show(5)
....
+---+---+------+--------+------+------+-----------+--------------+
|age|sex|   bmi|children|smoker|region|    charges|insuranceclaim|
+---+---+------+--------+------+------+-----------+--------------+
| 19|  0|  27.9|       0|     1|     3|  16884.924|             1|
| 18|  1| 33.77|       1|     0|     2|  1725.5523|             1|
| 28|  1|  33.0|       3|     0|     2|   4449.462|             0|
| 33|  1|22.705|       0|     0|     1|21984.47061|             0|
| 32|  1| 28.88|       0|     0|     1|  3866.8552|             1|
+---+---+------+--------+------+------+-----------+--------------+
only showing top 5 rows
....

# Create a mask that will bucket the data into buckets of size 1000 between 5000 and 50000
mask_equal_buckets = BucketingMask(bucket_size=1000, lower_bound=5000, upper_bound=50000)

# The masked data is returned by calling `transform()`. The original data is not modified.
masked_data = mask_equal_buckets.transform(data, col='charges')
masked_data.show(5)
....
[Stage 4:>                                                          (0 + 1) / 1]
+---+---+------+--------+------+------+---------------+--------------+
|age|sex|   bmi|children|smoker|region|        charges|insuranceclaim|
+---+---+------+--------+------+------+---------------+--------------+
| 19|  0|  27.9|       0|     1|     3|16000.0:17000.0|             1|
| 18|  1| 33.77|       1|     0|     2|        <5000.0|             1|
| 28|  1|  33.0|       3|     0|     2|        <5000.0|             0|
| 33|  1|22.705|       0|     0|     1|21000.0:22000.0|             0|
| 32|  1| 28.88|       0|     0|     1|        <5000.0|             1|
+---+---+------+--------+------+------+---------------+--------------+
only showing top 5 rows
....

Available Masks

The following masks are available in the sdk:

Bucketing

Masks exact numerical values in a column by bucketing numerical data into buckets of equal size or user-specified buckets.

Details

This can be used in two modes:

  1. Bucketing into equal sized buckets: In this mode, users should pass the arguments bucket_size, lower_bound and upper_bound when initialising the mask. The data will be bucketed into buckets of size bucket_size between lower_bound and upper_bound.

  2. User-specified bucketting: In this mode, users can define specific bucket ranges and names for the buckets via the bucket_config argument. The bucket_config is a list of dictionaries. Each dictionary specifies a bucket, and must have the following keys:

    • 'min': The minimum value of the bucket

    • 'max': The maximum value of the bucket

    • 'replacement_value': The value to replace values in the range with. All values in the column being masked must fall into one of the buckets specified in the bucket_config. Examples for both modes of operation are shown below.

Example:

from synthesized3.mask import BucketingMask
import synthesized_datasets as sd

data = sd.ALL.claim_prediction.load()
data.head()
....
   age  sex     bmi  children  smoker  region      charges  insuranceclaim
0   19    0  27.900         0       1       3  16884.92400               1
1   18    1  33.770         1       0       2   1725.55230               1
2   28    1  33.000         3       0       2   4449.46200               0
3   33    1  22.705         0       0       1  21984.47061               0
4   32    1  28.880         0       0       1   3866.85520               1
....

# Create a mask that will bucket the data into buckets of size 1000 between 5000 and 50000
mask_equal_buckets = BucketingMask(bucket_size=1000, lower_bound=5000, upper_bound=50000)

# The masked data is returned by calling `transform()`. The original data is not modified.
mask_equal_buckets.transform(data, col='charges')
....
      age  sex     bmi  children  smoker  region          charges  insuranceclaim
0      19    0  27.900         0       1       3  16000.0:17000.0               1
1      18    1  33.770         1       0       2          <5000.0               1
2      28    1  33.000         3       0       2          <5000.0               0
3      33    1  22.705         0       0       1  21000.0:22000.0               0
4      32    1  28.880         0       0       1          <5000.0               1
...   ...  ...     ...       ...     ...     ...              ...             ...
....

Example:

bucket_config = [
    {"min": 2000, "max": 10000, "replacement_value": "Low"},
    {"min": 10000, "max": 30000, "replacement_value": "Medium"},
    {"min": 30000, "max": 80000, "replacement_value": "High"},
]
mask_config_buckets = BucketingMask(bucket_config=bucket_config)

mask_config_buckets.transform(data, col='charges')
....
      age  sex     bmi  children  smoker  region charges  insuranceclaim
0      19    0  27.900         0       1       3  Medium               1
1      18    1  33.770         1       0       2     Low               1
2      28    1  33.000         3       0       2     Low               0
3      33    1  22.705         0       0       1  Medium               0
4      32    1  28.880         0       0       1     Low               1
...   ...  ...     ...       ...     ...     ...     ...             ...
....

DateShift:

Masks exact date values by shifting dates by an random amount between a specified range. This has 3 modes of operation:

  • Randomly shift the date by a random number of days for all dates in the column. This is the default mode of operation.

  • Shift all dates together by a random amount to maintain intervals between all dates in the column. Toggled by setting maintain_diff=True.

  • Shift all dates within a group by a random amount where the group is defined by an 'entity column'. of the dataset. Toggled by setting the entity_column argument.

Details

Example:

Load example data:

import pandas as pd
from synthesized3.mask import DateShiftMask
import synthesized_datasets as sd

data = sd.ALL.s_and_p_500_5yr.load()
data.head()
....
              date   open   high    low  close    volume Name
0       2013-02-08  15.07  15.12  14.63  14.75   8407500  AAL
1       2013-02-11  14.89  15.01  14.26  14.46   8882000  AAL
2       2013-02-12  14.45  14.51  14.10  14.27   8126000  AAL
3       2013-02-13  14.30  14.94  14.25  14.66  10259500  AAL
4       2013-02-14  14.94  14.96  13.16  13.99  31879900  AAL
...            ...    ...    ...    ...    ...       ...  ...
618096  2018-02-01  76.84  78.27  76.69  77.82   2982259  ZTS
618097  2018-02-02  77.53  78.12  76.73  76.78   2595187  ZTS
618098  2018-02-05  76.64  76.92  73.18  73.83   2962031  ZTS
618099  2018-02-06  72.74  74.56  72.13  73.27   4924323  ZTS
618100  2018-02-07  72.70  75.00  72.69  73.86   4534912  ZTS
....

# Ensure that the date column is of type 'datetime'
data['date'] = pd.to_datetime(data['date'])

Using mode 1: Randomly shift the date by a random number of days for all dates in the column.

dateshiftmask = DateShiftMask(lower_bound_days=-30, upper_bound_days=30)
dateshiftmask.transform(data, col='date').head()
....
        date   open   high    low  close    volume Name
0 2013-02-21  15.07  15.12  14.63  14.75   8407500  AAL
1 2013-01-30  14.89  15.01  14.26  14.46   8882000  AAL
2 2013-02-16  14.45  14.51  14.10  14.27   8126000  AAL
3 2013-01-16  14.30  14.94  14.25  14.66  10259500  AAL
4 2013-01-20  14.94  14.96  13.16  13.99  31879900  AAL
....

Using mode 2: Shift all dates together by a random amount to maintain intervals between all dates in the column.

dateshiftmask = DateShiftMask(lower_bound_days=-30, upper_bound_days=30, maintain_diff=True)
dateshiftmask.transform(data, col='date').head()
....
        date   open   high    low  close    volume Name
0 2013-01-26  15.07  15.12  14.63  14.75   8407500  AAL
1 2013-01-29  14.89  15.01  14.26  14.46   8882000  AAL
2 2013-01-30  14.45  14.51  14.10  14.27   8126000  AAL
3 2013-01-31  14.30  14.94  14.25  14.66  10259500  AAL
4 2013-02-01  14.94  14.96  13.16  13.99  31879900  AAL
....

Using mode 3: Shift all dates within a group by a random amount where the group is defined by an 'entity column' of the dataset.

dateshiftmask = DateShiftMask(lower_bound_days=-30, upper_bound_days=30, maintain_diff=True, entity_column='Name')

# Notice how the dates are shifted by the same amount within each group, but the amounts are different between groups.
dateshiftmask.transform(data, col='date')
....
             date   open   high    low  close    volume Name
0      2013-02-07  15.07  15.12  14.63  14.75   8407500  AAL
1      2013-02-10  14.89  15.01  14.26  14.46   8882000  AAL
2      2013-02-11  14.45  14.51  14.10  14.27   8126000  AAL
3      2013-02-12  14.30  14.94  14.25  14.66  10259500  AAL
4      2013-02-13  14.94  14.96  13.16  13.99  31879900  AAL
...           ...    ...    ...    ...    ...       ...  ...
618096 2018-01-31  76.84  78.27  76.69  77.82   2982259  ZTS
618097 2018-02-01  77.53  78.12  76.73  76.78   2595187  ZTS
618098 2018-02-04  76.64  76.92  73.18  73.83   2962031  ZTS
618099 2018-02-05  72.74  74.56  72.13  73.27   4924323  ZTS
618100 2018-02-06  72.70  75.00  72.69  73.86   4534912  ZTS
....

Deterministic Encryption

Uses the AES (Advanced European Standard) encryption algorithm to encrypt data in a column. The encryption is deterministic, meaning that the same value will always be encrypted to the same value preserving referential integrity of the data. This means that the encrypted data can be used for joins and other operations that require the encrypted data to be comparable. This masking is reversible using the key.

Details

Users must provide a key when initialising the mask. The key must be a 16, 24 or 32 byte string. Additionally, users may choose to add an additional 16 byte 'tweak' to the key which enhances security. The encrypted data is returned by calling transform().

Example:

Load example data:

import synthesized_datasets as sd
data = sd.ALL.healthcare.load()
   Unnamed: 0      gender first_name  last_name  weight     NHS_number  ... postcode  synchronous_tumour_indicator pathology_investigation_type lesion_size number_of_lesions  outcome
0           0      Female      Maude    Jackson      50  256 3138 8154  ...  NP132JL                      0.116682                            0    0.852172                11        0
1           1      Female     Leanne     Potter      44  640 0311 3044  ...  EH146AE                      0.916627                            1    0.709176                 6        0
2           2        Male     Johnie     Carney      82  113 2535 9715  ...  TW7 6LG                      0.312163                            1    1.533713                11        1
3           3      Female    Susanne     Joseph      60  807 8602 0184  ...  SW130EH                      0.040120                            0    0.400135                16        0
4           4  Non-binary       Cora  Blackburn      53  657 6533 0112  ...  PO318HA                      0.143843                            0    0.116941                24        0

Apply masking:

from synthesized3.mask import DeterministicEncryptionMask
import secrets

# Generate a random key
key = secrets.token_hex(16)
deterministic_encryption_mask = DeterministicEncryptionMask(key=key)
masked_data = deterministic_encryption_mask.transform(data, col='first_name')
masked_data.head()
   Unnamed: 0      gender                            first_name  last_name  weight  ... synchronous_tumour_indicator pathology_investigation_type  lesion_size number_of_lesions outcome
0           0      Female      3qonb1A=P2ew8fUFcmyUISbLZyiyiA==    Jackson      50  ...                     0.116682                            0     0.852172                11       0
1           1      Female      37bmyVSDX8pZTaodaUF3rx0qBAxXEg==     Potter      44  ...                     0.916627                            1     0.709176                 6       0
2           2        Male      6TRz6LmfEnuIREDhFGu4cG+5A71rBw==     Carney      82  ...                     0.312163                            1     1.533713                11       1
3           3      Female  YYesI7S7Ag==7lobPz+yMKwTU5qdzv/wew==     Joseph      60  ...                     0.040120                            0     0.400135                16       0
4           4  Non-binary      cU3qyg==iP/YYnl6ethAzOH7JYbT3Q==  Blackburn      53  ...                     0.143843                            0     0.116941                24       0

[5 rows x 16 columns]

Reverse masking:

deterministic_encryption_mask.inverse_transform(masked_data, col='first_name').head()
   Unnamed: 0      gender first_name  last_name  weight     NHS_number  ... postcode  synchronous_tumour_indicator pathology_investigation_type lesion_size number_of_lesions  outcome
0           0      Female      Maude    Jackson      50  256 3138 8154  ...  NP132JL                      0.116682                            0    0.852172                11        0
1           1      Female     Leanne     Potter      44  640 0311 3044  ...  EH146AE                      0.916627                            1    0.709176                 6        0
2           2        Male     Johnie     Carney      82  113 2535 9715  ...  TW7 6LG                      0.312163                            1    1.533713                11        1
3           3      Female    Susanne     Joseph      60  807 8602 0184  ...  SW130EH                      0.040120                            0    0.400135                16        0
4           4  Non-binary       Cora  Blackburn      53  657 6533 0112  ...  PO318HA                      0.143843                            0    0.116941                24        0

Format Preserving Hashing

Hashes values using the SHA256 algorithm whilst preserving the format of the original values. The format is preserved by specifying an alphabet of characters that the hashed values can contain. The alphabet can be any string of characters. The alphabet must be at least 2 characters long and must not contain any duplicate characters.

Details

Example:

Load example data:

import synthesized_datasets as sd
data = sd.ALL.healthcare.load()
data = data[['first_name','last_name', 'NHS_number']]
data.head()
....
  first_name  last_name     NHS_number
0      Maude    Jackson  256 3138 8154
1     Leanne     Potter  640 0311 3044
2     Johnie     Carney  113 2535 9715
3    Susanne     Joseph  807 8602 0184
4       Cora  Blackburn  657 6533 0112
....

Apply masking:

from synthesized3.mask import FormatPreservingHashingMask

# `string.ascii_letters` is a string containing all ascii letters
import string
format_preserving_hashing_mask = FormatPreservingHashingMask(alphabet=string.ascii_letters)

# Masking both first and last name
masked_data = format_preserving_hashing_mask.transform(data, col='first_name')
format_preserving_hashing_mask.transform(masked_data, col='last_name').head()
....
  first_name  last_name     NHS_number
0      EKjQa    DsBSXaV  256 3138 8154
1     PsQrzl     MldqlS  640 0311 3044
2     kbczST     xetAKw  113 2535 9715
3    ZdZoamD     uFAAaN  807 8602 0184
4       bBmE  FLnKanShx  657 6533 0112
....

Format Preserving Encryption

Used to encrypt values of the data whilst preserving the format. The masking uses the FF3-1 algorithm. The format is preserved by specifying an alphabet of characters that the encrypted values can contain. The alphabet can be any string of characters. The alphabet must be at least 2 characters long and must not contain any duplicate characters. The FPE mask can be inverted by using the same key, tweak and alphabet and calling inverse_transform() method.

Details

Example:

Load example data:

import synthesized_datasets as sd
data = sd.ALL.healthcare.load()
data = data[['first_name','last_name', 'NHS_number']]
data.head()
....
  first_name  last_name     NHS_number
0      Maude    Jackson  256 3138 8154
1     Leanne     Potter  640 0311 3044
2     Johnie     Carney  113 2535 9715
3    Susanne     Joseph  807 8602 0184
4       Cora  Blackburn  657 6533 0112
....

Apply masking:

from synthesized3.mask import FormatPreservingEncryptionMask

# Build a key
import secrets
key = secrets.token_hex(16)
tweak = secrets.token_hex(8)


# `string.ascii_letters` is a string containing all ascii letters
import string
format_preserving_encryption_mask = FormatPreservingEncryptionMask(key=key, tweak=tweak, alphabet=string.ascii_letters)

masked_data = format_preserving_encryption_mask.transform(data, col='first_name')
format_preserving_encryption_mask.transform(masked_data, col='last_name').head()
....
   first_name	last_name	NHS_number
0	ZtRcF	      koIMzpj	    256 3138 8154
1	WqMngQ	   ryDhPl	    640 0311 3044
2	lahXhO	   HKCGDp	    113 2535 9715
3	TGQaFOm	   KKNZNz	    807 8602 0184
4	fhGv	      FISWzrfpB	 657 6533 0112
....

Reverse masking using the same key, tweak and alphabet:

new_format_preserving_encryption_mask = FormatPreservingEncryptionMask(key=key, tweak=tweak, alphabet=string.ascii_letters)
new_format_preserving_encryption_mask.inverse_transform(masked_data, col='first_name').head()
....
	first_name	last_name	NHS_number
0	Maude	      Jackson	    256 3138 8154
1	Leanne	   Potter	    640 0311 3044
2	Johnie	   Carney	    113 2535 9715
3	Susanne	   Joseph	    807 8602 0184
4	Cora	      Blackburn	 657 6533 0112
....

Nullify

Replaces values in a column with null values.

Details

Example:

Load example data:

import synthesized_datasets as sd
data = sd.ALL.healthcare.load()
data = data[['first_name','last_name', 'NHS_number']]
data.head()
....
  first_name  last_name     NHS_number
0      Maude    Jackson  256 3138 8154
1     Leanne     Potter  640 0311 3044
2     Johnie     Carney  113 2535 9715
3    Susanne     Joseph  807 8602 0184
4       Cora  Blackburn  657 6533 0112
....

Apply masking:

from synthesized3.mask import NullMask

null_mask = NullMask()
null_mask.transform(data, col='NHS_number').head()
....
	first_name	last_name	NHS_number
0	Maude	      Jackson	    None
1	Leanne	   Potter	    None
2	Johnie	   Carney	    None
3	Susanne	   Joseph	    None
4	Cora	      Blackburn	 None
....

Redact

Removes values in a column. There are several modes of operation for this mask: 1. The default mode of running is to redact the whole value entirely. 2. Users can choose to redact only a portion of the dataset by specifying the portion argument. By default the redaction starts from the begginning of the value, but users can choose to redact from the end by setting mask_start=False. 3. Users can specify a regular expression to match values to redact by setting the pattern argument.

Details

Example:

Load example data:

data = sd.ALL.healthcare.load()
data = data[['city','postcode']]
data.head()
....
	city	         postcode
0	ABERTILLERY	   NP132JL
1	CURRIE	      EH146AE
2	ISLEWORTH	   TW7 6LG
3	LONDON SW13	   SW130EH
4	COWES	         PO318HA
....

Apply masking:

from synthesized3.mask import RedactionMask
# Default mode of operation
redaction_mask = RedactionMask()
redaction_mask.transform(data, col='postcode').head()
....
	city	      postcode
0	ABERTILLERY
1	CURRIE
2	ISLEWORTH
3	LONDON SW13
4	COWES
....

Redact only a portion of the data:

redaction_mask_portion = RedactionMask(portion=0.3, mask_start=False)
redaction_mask_portion.transform(data, col='postcode').head()
....
city	         postcode
0	ABERTILLERY	NP13
1	CURRIE	   EH14
2	ISLEWORTH	TW7
3	LONDON SW13	SW13
4	COWES	      PO31
....

Notice how part of the postcode is included in the city column, let’s fix that using a regular expression:

redaction_mask_regex = RedactionMask(pattern=r'\s\w+$')
redaction_mask_regex.transform(data, col='city').head()
....
	city	      postcode
0	ABERTILLERY	NP132JL
1	CURRIE	   EH146AE
2	ISLEWORTH	TW7 6LG
3	LONDON	   SW130EH
4	COWES	      PO318HA
....

Replace

Replaces values in a column with a specified value. The value, or portion of value to be replaced can be defined by passing a regex pattern. The replacement values can be specified either by passing a single (string) value, or a list of strings. If a list is passed, the replacement values will be chosen at random from the list.

Details

Example:

Load example data:

data = sd.ALL.healthcare.load()
data = data[['first_name','last_name','NHS_number']]
data.head()
....
	first_name	last_name	NHS_number
0	Maude	      Jackson	    256 3138 8154
1	Leanne	   Potter	    640 0311 3044
2	Johnie	   Carney	    113 2535 9715
3	Susanne	   Joseph	    807 8602 0184
4	Cora	      Blackburn	 657 6533 0112
.....

# Default behaviour is to match everything
replacement_mask = ReplacementMask(replacement_value="*")
replacement_mask.transform(data, col='NHS_number').head()
....
	first_name	last_name	NHS_number
0	Maude	      Jackson	    **
1	Leanne	   Potter	    **
2	Johnie	   Carney	    **
3	Susanne	   Joseph	    **
4	Cora	      Blackburn	 **
....

# Replacing with a sample of values from a list
replacement_name_mask = ReplacementMask(replacement_value=['FIRST_NAME_1', 'FIRST_NAME_2', 'FIRST_NAME_3'])
replacement_name_mask.transform(data, col="first_name").head()
....
	first_name	    last_name	NHS_number
0	FIRST_NAME_3	 Jackson	    256 3138 8154
1	FIRST_NAME_2	 Potter	    640 0311 3044
2	FIRST_NAME_1	 Carney	    113 2535 9715
3	FIRST_NAME_3	 Joseph	    807 8602 0184
4	FIRST_NAME_1	 Blackburn	 657 6533 0112
....

# Replacing specific parts of the data via regex matching
replace_ones_mask = ReplacementMask(replacement_value="*", pattern="1")
replace_ones_mask.transform(data, col="NHS_number").head()
....
	first_name	last_name	NHS_number
0	Maude	       Jackson	    256 3*38 8*54
1	Leanne	    Potter	    640 03** 3044
2	Johnie	    Carney	    **3 2535 97*5
3	Susanne	    Joseph	    807 8602 0*84
4	Cora	       Blackburn	 657 6533 0**2
....

Time Extraction

Extracts time information from a datetime column. The time information can be extracted in the following ways:

  • YEAR: [0-9999]

  • MONTH: [1-12]

  • DAY_OF_MONTH: [1-31]

  • DAY_OF_WEEK: [1-7]

  • WEEK_OF_YEAR: [1-53]

  • HOUR_OF_DAY: [0-23]

  • MINUTE_OF_HOUR: [0-59]

  • SECOND_OF_MINUTE: [0-59]

  • MICROSECOND_OF_SECOND: [0-999999]

These can be extracted by passing them as the value for the argument 'time_part' upon initialisation.

Details

Example:

Load example data:

import synthesized_datasets as sd
data = sd.ALL.noaa_isd_weather_additional_dtypes_small.load()
data = data[['datetime', 'longitude','latitude']]
data.head()
....
	datetime	            longitude	latitude
0	2019-04-02 17:55:00	-170.212	   57.158
1	2019-04-02 14:30:00	-170.212	   57.158
2	2019-04-02 08:00:00	-170.212	   57.158
3	2019-04-02 09:30:00	-102.774	   33.956
4	2019-04-02 14:20:00	-117.526	   47.417
....

# Ensure that the date column is of type 'datetime'
import pandas as pd
data['datetime'] = pd.to_datetime(data['datetime'])

# Extract only the month
time_extraction_mask = TimeExtractionMask(time_part='MONTH')
time_extraction_mask.transform(data, col='datetime').head()
....
	datetime	longitude	latitude
0	4	      -170.212	   57.158
1	4	      -170.212	   57.158
2	4	      -170.212	   57.158
3	4	      -102.774	   33.956
4	4	      -117.526	   47.417
....

# Extract only the minute of the hour
time_extraction_mask = TimeExtractionMask(time_part='MINUTE_OF_HOUR')
time_extraction_mask.transform(data, col='datetime').head()
....
	datetime	longitude	latitude
0	55	      -170.212	57.158
1	30	      -170.212	57.158
2	0	      -170.212	57.158
3	30	      -102.774	33.956
4	20	      -117.526	47.417
....