Generation
Software prerequisites
To run the following examples on your machine, you should first:
-
Install TDK CLI
-
Download tdk-tutorial-databases repository
git clone https://github.com/synthesized-io/tdk-tutorial-databases
-
Start PostgreSQL instances and pgAdmin using the following commands:
cd tdk-tutorial-databases/ docker-compose run databases
pgAdmin will be available at http://localhost:8888 with the password postgres
.
In order to run TDK with the configurations provided in this tutorial, save them to config.yaml
file and execute the following command:
tdk \
--input-url jdbc:postgresql://localhost:6000/postgres --input-username postgres --input-password postgres \
--output-url jdbc:postgresql://localhost:6001/postgres --output-username postgres --output-password postgres \
--config-file config_masking.tdk.yaml
Instead of masking the original data, we can generate new data. This approach might be useful if we need to produce more data than what is available in the source database, for example, for load testing. Let’s proceed to generate some data:
default_config:
mode: GENERATION
target_ratio: 2
table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
safety_mode: RELAXED
We can check results using this SQL:
select 'actor' as "table", count(1) as numrows from actor
union
select 'address', count(1) from address
union
select 'category', count(1) from category
union
select 'city', count(1) from city
union
select 'country', count(1) from country
union
select 'customer', count(1) from customer
union
select 'film', count(1) from film
union
select 'film_actor', count(1) from film_actor
union
select 'film_category', count(1) from film_category
union
select 'inventory', count(1) from inventory
union
select 'language', count(1) from language
union
select 'payment', count(1) from payment
union
select 'rental', count(1) from rental
union
select 'staff', count(1) from staff
union
select 'store', count(1) from store
The result on source database will look approximately like this:
tablename | numrows |
---|---|
actor |
200 |
address |
603 |
category |
16 |
city |
600 |
film |
1000 |
film_actor |
5462 |
payment |
16049 |
rental |
16044 |
After the generation we can see that number of records doubled in each of the table:
actor | 400 |
---|---|
address |
1206 |
category |
32 |
city |
1200 |
film |
2000 |
film_actor |
10771 |
payment |
32098 |
rental |
32088 |
Here we can check that all relations in the customer→address→city→country chain work well. The result of this query on the output database should be twice as big as the result on the input database:
select count(*) from customer
inner join address on customer.address_id = address.address_id
inner join city on address.city_id = city.city_id
inner join country on city.country_id = country.country_id
Again, we can override the default table settings in tables
.
In this example, we are setting a different target ratio for specific tables:
default_config:
mode: GENERATION
target_ratio: 2
tables:
- table_name_with_schema: "public.film"
mode: GENERATION
target_ratio: 3
transformations:
- columns: [ "title" ]
params:
type: "formatted_string_generator"
pattern: "[A-Z]{16}"
table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
safety_mode: RELAXED
Some generators are "fake" ones like formatted_string_generator below, which produces values independently of the original data. But some use original data to fit a model and create values following underlying distribution.
|
Here is an example of using categorical_generator
with provided probabilities:
default_config:
mode: GENERATION
target_ratio: 2
tables:
- table_name_with_schema: "public.film"
mode: GENERATION
target_ratio: 3
transformations:
- columns: [ "rating" ]
params:
type: "categorical_generator"
categories:
value_source: PROVIDED
values:
"PG-13": 223
"NC-17": 21
"R": 195
"PG": 194
"G": 178
table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
safety_mode: RELAXED
You omit categories
and probabilities
, and they will be learned from the original data.