Getting Started

Generate data for an empty database

In order to quickly make yourself familiar with TDK and its features without getting access to production-scale database servers, you can use our demo project based on Docker Compose. It comes with two preconfigured PostgreSQL instances, a Pagila database schema and TDK.

To run this demo, you need the following to be installed on your machine:

Run the following commands:

git clone https://github.com/synthesized-io/tdk-demo
cd tdk-demo/postgres
export CONFIG_FILE=config_generation_from_scratch.tdk.yaml
docker-compose run tdk_admin

In this and following examples, if Windows operating system is used, set CONFIG_FILE=… should be typed instead of export CONFIG_FILE=…

This will spin up two PostgreSQL instances and pgAdmin in containers and run a TDK transformation defined in the config_generation_from_scratch.tdk.yaml file from the tdk-docker-demo folder. The docker-compose run tdk command will take some time to complete the data generation process. Once it finishes, the two database instances will be available to connect to and browse the data.

After that, we can connect to the output database using thе following credentials:

navigate to http://localhost:8888
navigate to output_db, provide password postgres when asked

and examine the schema and generated data.

For example, the film table will look like following:

film_id	title	description
0	70Y8476yOn1a0	20xmSDz_HOO_Tt_bL_zR4IS7
1	jFhpoAwf_	vtD
2		7jVZ_vDQM_po03zU
3	QIjfVjETHkhYEyN1D8kEOeyeGMG9q	YGZXCcbhSYVgpBCh4n
4	jrabi	5MwhCeSmIFhEMDV
5	1YPSQL2h6XKAOnJDtOjpZ74rmyf_
6	dlivZ	eLD8HnqqNXtgj0CPowmdzs
7	Cz0h1Ygv9Kz7XeyKLFIQIYIw7evbo	PgeYtd65f47kM3IgRECGznzxnQllpD
8	e4EY	Zgl06VHFGrGTI9ZuloVI

film_id

title

description

70Y8476yOn1a0

20xmSDz_HOO_Tt_bL_zR4IS7

jFhpoAwf_

vtD

7jVZ_vDQM_po03zU

QIjfVjETHkhYEyN1D8kEOeyeGMG9q

YGZXCcbhSYVgpBCh4n

jrabi

5MwhCeSmIFhEMDV

1YPSQL2h6XKAOnJDtOjpZ74rmyf_

dlivZ

eLD8HnqqNXtgj0CPowmdzs

Cz0h1Ygv9Kz7XeyKLFIQIYIw7evbo

PgeYtd65f47kM3IgRECGznzxnQllpD

e4EY

Zgl06VHFGrGTI9ZuloVI

The actor table content will look like this:

actor_id	first_name	last_name	last_update
0	bNjI7RkrVIVwe9pNcuhWka	oLHWX0	1986-11-19 19:45:48.781+00
1	3ez9	B3EIF	1982-10-24 09:13:26.649+00
2	A7YIAtb7RxCZ	tWuLHIE04ROLtRnVdg5NGrTh	1977-09-27 13:40:08.226+00
3	DDYtdEgfOouqk	atfWawX	1970-03-24 11:42:20.939+00
4	S0c	dgeZ1uRMa7FmweQvCW_j	2022-07-10 23:13:40.561+00
5	GCQW5U6SBGymjaoZ4Zp6D	HhD4uZnGVv	2017-05-21 13:32:35.787+00
6	wFtiY3GdXwLcOvPyCMo_L	ycN5fiwHZkK6Z6956LMmco0	2013-04-13 13:46:33.934+00
7	xa_d4IDVAF_fRSEOl0iEVNfjmCTJTt7w	B2	2009-09-21 00:49:25.392+00

actor_id

first_name

last_name

last_update

bNjI7RkrVIVwe9pNcuhWka

oLHWX0

1986-11-19 19:45:48.781+00

3ez9

B3EIF

1982-10-24 09:13:26.649+00

A7YIAtb7RxCZ

tWuLHIE04ROLtRnVdg5NGrTh

1977-09-27 13:40:08.226+00

DDYtdEgfOouqk

atfWawX

1970-03-24 11:42:20.939+00

S0c

dgeZ1uRMa7FmweQvCW_j

2022-07-10 23:13:40.561+00

GCQW5U6SBGymjaoZ4Zp6D

HhD4uZnGVv

2017-05-21 13:32:35.787+00

wFtiY3GdXwLcOvPyCMo_L

ycN5fiwHZkK6Z6956LMmco0

2013-04-13 13:46:33.934+00

xa_d4IDVAF_fRSEOl0iEVNfjmCTJTt7w

2009-09-21 00:49:25.392+00

As we can see, first_name and last_name fields contain random strings that don’t look like names of people. We can improve the configuration and make it use person_generator. To do this, add the following to the tables section of the config_generation_from_scratch.tdk.yaml file from the pagila-tdk-demo folder:

  - table_name_with_schema: "public.actor"
    transformations:
      - columns:
          - "first_name"
          - "last_name"
        params:
          type: "person_generator"
          column_templates:
            - "${first_name}"
            - "${last_name}"

Run docker-compose run tdk again and re-query the data from actor table. You will see more realistic names for actors:

actor_id	first_name	last_name	last_update
0	Brian	Cronin	1986-11-19 19:45:48.781+00
1	Kurtis	Lebsack	1982-10-24 09:13:26.649+00
2	Lenard	Pfeffer	1977-09-27 13:40:08.226+00
3	Aretha	Paucek	1970-03-24 11:42:20.939+00
4	Vania	Stark	2022-07-10 23:13:40.561+00
5	Giovanni	Schinner	2017-05-21 13:32:35.787+00
6	Hans	Willms	2013-04-13 13:46:33.934+00

actor_id

first_name

last_name

last_update

Brian

Cronin

1986-11-19 19:45:48.781+00

Kurtis

Lebsack

1982-10-24 09:13:26.649+00

Lenard

Pfeffer

1977-09-27 13:40:08.226+00

Aretha

Paucek

1970-03-24 11:42:20.939+00

Vania

Stark

2022-07-10 23:13:40.561+00

Giovanni

Schinner

2017-05-21 13:32:35.787+00

Hans

Willms

2013-04-13 13:46:33.934+00

Congratulations on completing your first data transformations using Synthesized TDK! You can now proceed with experiments using various configurations and databases.

Mask data

Data masking is a technique used to hide sensitive or confidential information in a database by replacing it with fictitious but realistic data. This is done to protect the privacy of individuals and organizations whose data is stored in the database.

You can use the following commands to mask the existing data in an example Pagila database:

docker-compose down
export CONFIG_FILE=config_masking.tdk.yaml
docker-compose -f docker-compose.yaml -f docker-compose-input-db.yaml run tdk_admin

The input database server is available in pgAdmin as input_db, the password is still postgres. Compare the content of input and output database tables to see how masking works. You can modify config_masking.tdk.yaml configuration to fine tune your masking script.

You can find out more about masking in Masking tutorial.

Generate data

Sometimes we need to inflate the database with additional records. This may be necessary for various scenarios, such as load testing, development, debugging, etc., when the available amount of data is insufficient.

The following example doubles the number of records in the Pagila database:

docker-compose down
export CONFIG_FILE=config_generation.tdk.yaml
docker-compose -f docker-compose.yaml -f docker-compose-input-db.yaml run tdk_admin

You can compare input and output databases by browsing input_db and output_db in pgAdmin at http://localhost:8888, respectively.

You can find out more about data generation in Generation tutorial.

Subset data

If the available database is too large, we may want to reduce its size by taking a subset of records in order to speed up development and testing.

The following example demonstrates how to subset the Pagila database:

docker-compose down
export CONFIG_FILE=config_subsetting.tdk.yaml
docker-compose -f docker-compose.yaml -f docker-compose-input-db.yaml run tdk_admin

As usual, input database is available as input_db, and output database is output_db in pgAdmin at http://localhost:8888.

More about subsetting is in Subsetting tutorial.