Masking
Software prerequisites
To run the following examples on your machine, you should first:
-
Install TDK CLI
-
Download tdk-tutorial-databases repository
git clone https://github.com/synthesized-io/tdk-tutorial-databases
-
Start PostgreSQL instances and pgAdmin using the following commands:
cd tdk-tutorial-databases/ docker-compose run databases
pgAdmin will be available at http://localhost:8888 with the password postgres
.
In order to run TDK with the configurations provided in this tutorial, save them to config.yaml
file and execute the following command:
tdk \
--input-url jdbc:postgresql://localhost:6000/postgres --input-username postgres --input-password postgres \
--output-url jdbc:postgresql://localhost:6001/postgres --output-username postgres --output-password postgres \
--config-file config_masking.tdk.yaml
If we run TDK with an empty configuration file, that would be the equivalent of using:
default_config:
mode: KEEP
In KEEP
mode, TDK copies data without any modification.
When processing production data, we need to ensure that no sensitive information left the environment. We can achieve this by using the MASKING
mode. Let’s start with a simple configuration:
default_config:
mode: MASKING
safety_mode: RELAXED
We can add TRUNCATE
and CREATE_IF_NOT_EXIST
options to truncate tables in the output database and create a schema only if it does not exist:
default_config:
mode: MASKING
table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
safety_mode: RELAXED
For more schema management options, refer to the YAML configuration part of the documentation.
In the example, we have tables customer
and address
linked by the address_id
field:
Here are samples of data from these tables in the source database:
address_id | address | district | city_id | postal_code | phone |
---|---|---|---|---|---|
5 |
1913 Hanoi Way |
Nagasaki |
463 |
35200 |
28303384290 |
6 |
1121 Loja Avenue |
California |
449 |
17886 |
838635286649 |
7 |
692 Joliet Street |
Attika |
38 |
83579 |
448477190408 |
8 |
1566 Inegl Manor |
Mandalay |
349 |
53561 |
705814003527 |
customer_id |
store_id |
first_name |
last_name |
address_id |
|
1 |
1 |
MARY |
SMITH |
5 |
|
2 |
1 |
PATRICIA |
JOHNSON |
6 |
|
3 |
1 |
LINDA |
WILLIAMS |
7 |
|
4 |
2 |
BARBARA |
JONES |
8 |
To run the masking transformation, use the following command:
tdk \
--input-url jdbc:postgresql://localhost:6000/postgres --input-username postgres --input-password postgres \
--output-url jdbc:postgresql://localhost:6001/postgres --output-username postgres --output-password postgres \
--config-file config_masking.tdk.yaml
TDK chooses the best masking mechanism based on the data type. You can see what transformations applied by using the dry-run
command. For more arguments, refer to the CLI reference).
Here is the result of default masking applied to address
table:
address_id | address | district | city_id | postal_code | phone |
---|---|---|---|---|---|
15183026 |
5314 Laypz Acm |
Plxwuyle |
1693424342 |
79518 |
18846059374 |
695573888 |
0383 Wmwq Cckxcg |
Iyzmcmbynx |
2076144198 |
76495 |
339497461347 |
716836103 |
306 Kqeywo Hqeekv |
Qfogmd |
1505222168 |
13429 |
277221410948 |
1546062460 |
3754 Bfgub Jijoy |
Twwndezz |
854724718 |
38281 |
640270266870 |
As you can see, all strings were masked with random characters. You may also notice that the format of strings, such as telephone numbers, has been preserved. This approach is called Format Preserving Hashing
. Read more about transformations here.
If we take a look at the customer
table, it has also been changed:
customer_id | store_id | first_name | last_name | address_id | |
---|---|---|---|---|---|
2013831174 |
2013831174 |
WQEJ |
YKDPW |
15183026 |
|
1843733460 |
2013831174 |
XJMVRIKE |
DEJECWQ |
695573888 |
|
589262816 |
2013831174 |
GBBQA |
FJENIEYU |
716836103 |
|
376640671 |
1843733460 |
HTIXRVE |
CURAO |
1546062460 |
Note that although numeric ids like address_id
are masked with random numbers, in linked tables they are masked synchronously so that foreign key relations persist. This approach is called Unique Hashing.
Sometimes we want to keep some tables as is. For that, we can use table settings overrides:
default_config:
mode: MASKING
tables:
- table_name_with_schema: "public.actor"
mode: KEEP
table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
safety_mode: RELAXED
Please note that the mode defined in default_config
is applied automatically to all tables, and
tables
work as override for that global default.
Also, we can customise our masking for others tables. Let’s imagine we need more realistic names, addresses, and phone numbers in customer
and address
tables. For this, we can use person_generator, address_generator and formatted_string_generator :
default_config:
mode: MASKING
tables:
- table_name_with_schema: "public.customer"
transformations:
- columns: ["first_name", "last_name", "email"]
params:
type: "person_generator"
column_templates: ["${first_name}", "${last_name}", "${email}"]
- table_name_with_schema: "public.address"
transformations:
- columns: ["address", "district", "postal_code"]
params:
type: "address_generator"
column_templates: ["${house_number} ${street_name}", "${city}", "${zip_code}"]
- columns: ["phone"]
params:
type: "formatted_string_generator"
pattern: "\\+44[0-9]{4} [0-9]{6}"
table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
safety_mode: RELAXED
Let’s see the result:
address_id | address | district | city_id | postal_code | phone |
---|---|---|---|---|---|
15183026 |
215 Feest Stream |
Benitamouth |
1693424342 |
NB2A 9AA |
+449537 136610 |
695573888 |
172 Frances Run |
Port Nedland |
2076144198 |
ZX9R 4LW |
+444078 471011 |
716836103 |
245 Hartmann Creek |
West Kaylene |
1505222168 |
UQ2W 7JW |
+447869 036453 |
1546062460 |
281 Barton Underpass |
Sethton |
854724718 |
SL8H 5HN |
+442518 225195 |
customer_id | store_id | first_name | last_name | address_id | |
---|---|---|---|---|---|
2013831174 |
2013831174 |
Casie |
Dooley |
15183026 |
|
1843733460 |
2013831174 |
Rickie |
Trantow |
695573888 |
|
589262816 |
2013831174 |
Lynn |
Kub |
716836103 |
|
376640671 |
1843733460 |
Raymonde |
Crona |
1546062460 |
More transformation parameters can be found here.
On the transformations page, you can find transformation compatibility in modes column. All "generators" can be used in both MASKING and GENERATION modes, but there are some purely "masking" transformations, such as noising , that can only be used in MASKING mode.
|
Let’s combine all examples above:
default_config:
mode: MASKING
tables:
- table_name_with_schema: "public.actor"
mode: KEEP
- table_name_with_schema: "public.customer"
transformations:
- columns: ["first_name", "last_name", "email"]
params:
type: "person_generator"
column_templates: ["${first_name}", "${last_name}", "${email}"]
- table_name_with_schema: "public.address"
transformations:
- columns: ["address", "district", "postal_code"]
params:
type: "address_generator"
column_templates: ["${house_number} ${street_name}", "${city}", "${zip_code}"]
- columns: ["phone"]
params:
type: "formatted_string_generator"
pattern: "\\+44[0-9]{4} [0-9]{6}"
table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
safety_mode: RELAXED