Masking

Software prerequisites

To run the following examples on your machine, you should first:

git clone https://github.com/synthesized-io/tdk-tutorial-databases
  • Start PostgreSQL instances and pgAdmin using the following commands:

cd tdk-tutorial-databases/
docker-compose run databases

pgAdmin will be available at http://localhost:8888 with the password postgres.

In order to run TDK with the configurations provided in this tutorial, save them to config.yaml file and execute the following command:

tdk \
    --input-url jdbc:postgresql://localhost:6000/postgres --input-username postgres --input-password postgres \
    --output-url jdbc:postgresql://localhost:6001/postgres --output-username postgres --output-password postgres \
    --config-file config_masking.tdk.yaml

If we run TDK with an empty configuration file, that would be the equivalent of using:

default_config:
  mode: KEEP

In KEEP mode, TDK copies data without any modification.

When processing production data, we need to ensure that no sensitive information left the environment. We can achieve this by using the MASKING mode. Let’s start with a simple configuration:

default_config:
  mode: MASKING
safety_mode: RELAXED

We can add TRUNCATE and CREATE_IF_NOT_EXIST options to truncate tables in the output database and create a schema only if it does not exist:

default_config:
  mode: MASKING

table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
safety_mode: RELAXED

For more schema management options, refer to the YAML configuration part of the documentation.

In the example, we have tables customer and address linked by the address_id field:

Diagram

Here are samples of data from these tables in the source database:

Table 1. address table before masking
address_id address district city_id postal_code phone

5

1913 Hanoi Way

Nagasaki

463

35200

28303384290

6

1121 Loja Avenue

California

449

17886

838635286649

7

692 Joliet Street

Attika

38

83579

448477190408

8

1566 Inegl Manor

Mandalay

349

53561

705814003527

Table 2. customer table before masking

customer_id

store_id

first_name

last_name

email

address_id

1

1

MARY

SMITH

MARY.SMITH@sakilacustomer.org

5

2

1

PATRICIA

JOHNSON

PATRICIA.JOHNSON@sakilacustomer.org

6

3

1

LINDA

WILLIAMS

LINDA.WILLIAMS@sakilacustomer.org

7

4

2

BARBARA

JONES

BARBARA.JONES@sakilacustomer.org

8

To run the masking transformation, use the following command:

tdk \
    --input-url jdbc:postgresql://localhost:6000/postgres --input-username postgres --input-password postgres \
    --output-url jdbc:postgresql://localhost:6001/postgres --output-username postgres --output-password postgres \
    --config-file config_masking.tdk.yaml

TDK chooses the best masking mechanism based on the data type. You can see what transformations applied by using the dry-run command. For more arguments, refer to the CLI reference).

Here is the result of default masking applied to address table:

Table 3. address table after masking
address_id address district city_id postal_code phone

15183026

5314 Laypz Acm

Plxwuyle

1693424342

79518

18846059374

695573888

0383 Wmwq Cckxcg

Iyzmcmbynx

2076144198

76495

339497461347

716836103

306 Kqeywo Hqeekv

Qfogmd

1505222168

13429

277221410948

1546062460

3754 Bfgub Jijoy

Twwndezz

854724718

38281

640270266870

As you can see, all strings were masked with random characters. You may also notice that the format of strings, such as telephone numbers, has been preserved. This approach is called Format Preserving Hashing. Read more about transformations here.

If we take a look at the customer table, it has also been changed:

Table 4. customer table after masking
customer_id store_id first_name last_name email address_id

2013831174

2013831174

WQEJ

YKDPW

ZBHY.OYHYA@fgnwkhodyxxgkk.lxc

15183026

1843733460

2013831174

XJMVRIKE

DEJECWQ

QRODGFBA.BYBTDSM@ywjxsmntkpvozf.xyu

695573888

589262816

2013831174

GBBQA

FJENIEYU

QOIZU.FFMEWGOU@zmrogyqpgdsapk.itw

716836103

376640671

1843733460

HTIXRVE

CURAO

EYPTGBK.TPICV@puabcqlglyjttb.qmw

1546062460

Note that although numeric ids like address_id are masked with random numbers, in linked tables they are masked synchronously so that foreign key relations persist. This approach is called Unique Hashing.

Sometimes we want to keep some tables as is. For that, we can use table settings overrides:

default_config:
  mode: MASKING

tables:
  - table_name_with_schema: "public.actor"
    mode: KEEP

table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
safety_mode: RELAXED

Please note that the mode defined in default_config is applied automatically to all tables, and tables work as override for that global default.

Also, we can customise our masking for others tables. Let’s imagine we need more realistic names, addresses, and phone numbers in customer and address tables. For this, we can use person_generator, address_generator and formatted_string_generator :

default_config:
  mode: MASKING

tables:
  - table_name_with_schema: "public.customer"
    transformations:
      - columns: ["first_name", "last_name", "email"]
        params:
          type: "person_generator"
          column_templates: ["${first_name}", "${last_name}", "${email}"]
  - table_name_with_schema: "public.address"
    transformations:
      - columns: ["address", "district", "postal_code"]
        params:
          type: "address_generator"
          column_templates: ["${house_number} ${street_name}", "${city}", "${zip_code}"]
      - columns: ["phone"]
        params:
          type: "formatted_string_generator"
          pattern: "\\+44[0-9]{4} [0-9]{6}"

table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
safety_mode: RELAXED

Let’s see the result:

Table 5. address table after masking with improved configuration
address_id address district city_id postal_code phone

15183026

215 Feest Stream

Benitamouth

1693424342

NB2A 9AA

+449537 136610

695573888

172 Frances Run

Port Nedland

2076144198

ZX9R 4LW

+444078 471011

716836103

245 Hartmann Creek

West Kaylene

1505222168

UQ2W 7JW

+447869 036453

1546062460

281 Barton Underpass

Sethton

854724718

SL8H 5HN

+442518 225195

Table 6. customer table after masking with improved configuration
customer_id store_id first_name last_name email address_id

2013831174

2013831174

Casie

Dooley

dooley2605@effertzquitzonandreinger.com

15183026

1843733460

2013831174

Rickie

Trantow

trantow7645@kertzmannhaley.com

695573888

589262816

2013831174

Lynn

Kub

kub7789@keeblerinc.com

716836103

376640671

1843733460

Raymonde

Crona

crona2450@padberggroup.com

1546062460

More transformation parameters can be found here.

On the transformations page, you can find transformation compatibility in modes column. All "generators" can be used in both MASKING and GENERATION modes, but there are some purely "masking" transformations, such as noising, that can only be used in MASKING mode.

Let’s combine all examples above:

default_config:
  mode: MASKING

tables:
  - table_name_with_schema: "public.actor"
    mode: KEEP
  - table_name_with_schema: "public.customer"
    transformations:
      - columns: ["first_name", "last_name", "email"]
        params:
          type: "person_generator"
          column_templates: ["${first_name}", "${last_name}", "${email}"]
  - table_name_with_schema: "public.address"
    transformations:
      - columns: ["address", "district", "postal_code"]
        params:
          type: "address_generator"
          column_templates: ["${house_number} ${street_name}", "${city}", "${zip_code}"]
      - columns: ["phone"]
        params:
          type: "formatted_string_generator"
          pattern: "\\+44[0-9]{4} [0-9]{6}"

table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
safety_mode: RELAXED