tdk-quest

Synthesized TDK Quest 📽️

In this engaging task, you’ll assume the role of an online marketplace development team leader. You’ll navigate through the entire application lifecycle, encountering and successfully resolving various challenges.

Your First Working Day đź‘”

So, you’ve just taken the lead of the team, and during the first meeting, you’re given all necessary information:

The company plans to speed up delivery by 25% this year, and it’s your job to make this happen.

The team organized a brainstorming session and identified the obstacles that are slowing down delivery. The following points were left on the marker board after an intense discussion:

After the discussion, you decide to meet with the Ops/DBA team to discuss how to make the dev/test databases look more like production ones, and to identify any obstacles in the process. Here’s what you discovered:

In a recent team meeting, a decision was made to optimize the bug fixing and performance enhancement processes. The key step is creating a test database populated with real production data. This means that data from the production database should be periodically replicated in test/dev databases in the automation mode. Sensitive information, such as customer personal details, finance information and so on, would be anonymized. You have been assigned to complete this task using Synthesized.

Challenge 0: Setup your workspace

What you will need

Run the Synthesized TDK Demo Environment

This project relies on the Synthesized TDK. You will need to run a Synthesized demo environment by following the instructions in the Synthesized User Guide.

You will need to download and unzip the latest version of governor-compose.zip and unzip it. Navigate to the unzipped directory and run:

    docker compose pull
    docker compose up

The UI will be available at (http://localhost:80). Use the predefined credentials to log in: test@synthesized.io / Qq12345_.

Connect to the Database

The Synthesized TDK Quest uses sample databases bundled within the demo environment. When you launch the Synthesized docker container using the default configuration, it starts a Docker service called transformed_db, which runs a PostgreSQL server. This server hosts two databases: input_source and output_source.

The input_source database is your production database for this task. Treat it like a production databases; you don’t need to make any changes here.

The output_source database is your non-production database. This is the database that you will be populating using the Synthesized TDK.

The default Synthesized configuration includes two data sources, input and output, that connect to these data sources.

To complete this task, you will need to see the contents of the two databases. Using your preferred database client, connect to the two databases using the details below. Please note that the databases are running inside the same docker container as Synthesized. This means that the connection details have different hostnames & ports when connecting locally vs when connecting through Synthesized.

  Input Output
Direct JDBC URL jdbc:postgresql://localhost:5433/input_source jdbc:postgresql://localhost:5433/output_source
Synthesized TDK JDBC URL jdbc:postgresql://transformed_db:5432/input_source jdbc:postgresql://transformed_db:5432/output_source
username apiuser apiuser
password apipassword apipassword

Once you have connected to the databases, take a look at the structure of the different tables. You should see that input contains data about actors, films, customers and staff. On the other hand, the output database is empty.

Setup your first Workflow

To populate the non-production database, you’re going to use Synthesized’s TDK.

If you’re not yet familiar with Synthesized TDK, you should learn about it by visiting the “What is TDK?” page in the official documentation, then reading about Masking.

From the home page create a new workflow and name it whatever you like. Choose to create an empty workflow. You will be prompted to select data sources, which should be automatically filled in as input and output. These are predefined data sources that connect to the demo database.

When you have created the workflow, you will see it has added some default setup. Replace the workflow with the following config:

   default_config:
      mode: MASKING

   schema_creation_mode: DROP_AND_CREATE
   safety_mode: "RELAXED"
   global_seed: 42

Note - global_seed: 42 is used here to ensure that the random data you generate will match the tests. Make sure you include that line for the duration of this task otherwise the automatic validation will fail.

Paste the example workflow config into the editor window and save. Once you have saved, run the workflow against the output data source. This will mask all of the data in the input database and use it to populate the

Validating Challenge 0

The Synthesized TDK Quest certification includes a set of commands to validate your test results. These can be found by downloading the git repository.

    git clone <https://github.com/synthesized-io/tdk-quest.git>
    cd tdk-quest

The challenge validators work by connecting to your output database and checking the contents. It will connect to the database using the credentials stored in database_configuration.yaml. If you are running the Synthesized TDK using the default docker config, you won’t need to change the database_configuration.yaml. If you are using a different database, fill in the details in that file.

You can now validate that you have set everything up correctly by running the first validation command:

    docker compose run check_env_ready

If your environment is prepared correctly, you should see: Valid setup

If you have any difficulties, check the following troubleshooting sections:

Could not connect to data source "output_db" This means that the validator couldn't access the output database. Either the database isn't running locally, or the `database_configuration.yaml` file does not have the correct credentials. Check to see if your database client is able to access the database. If the database client can't access the database either, make sure the Synthesized Governor docker compose image is running. If the database client can access the database but the tests can't, update `database_configuration.yaml`.
Metrics were not computed This is the error that is reported with the validator can access the database, but can't access the individual tables. Typically it means that you have not run the Synthesized workflow yet. Run the workflow, and use your database client to check that the output database has been populated.
Some of the tests are failing If some of the setup tests are failing, it means there's an issue in the workflow. Make sure to copy the workflow from the `Setup your first Workflow` section exactly, and remember to save! If you are missing the `mode: MASKING` setting, the system won't be generating the right masked data. If you are missing the `global_seed: 42` setting, the system could be masking data correctly, but won't be generating the values we are looking for.

When the tests pass, congratulations! You have successfully set up the team’s new non-production database.

Challenge 1: Masking the right data

With the new non-production database, the team can start testing properly. It’s great that it doesn’t include any personal data, but the team start commenting that some data doesn’t look the way they expect it to.

Luckily, with Synthesized you are in control of how data should be processed. After taking a look at the synthesized output, you set yourself a few requirements:

Requirement 1.1: The actor and film data in the system is publicly available, and should not be masked.

Requirement 1.2: While location data is sensitive, the city and country data in the system is not confidential and should remain as is.

Requirement 1.3: The address and staff data in the system are sensitive. We should use GENERATE mode to make the data more realistic.

Hints Synthesized has three [transformation modes](https://docs.synthesized.io/tdk/latest/user_guide/130_reference/configuration_reference#UserTransformationMode): `MASKING`, `GENERATION`, and `KEEP`.

Validating Challenge 1

To solve this challenge, you need to edit the Synthesized workflow so that the synthesized data meets all of the requirements. Make your edits, save, run the workflow, then check the database.

When you are sure that the data in the database has left the actor, film, city, and country data as it was, check your work by running:

    docker compose run check_challenge_1

Note that this is check_challenge_1, rather than check_setup.

Update the workflow configuration file and rerun the transformation and tests until you receive the All is good message in your terminal.

Challenge 2: Keeping the engineering team happy during Black Friday

The sales team plans to announce up to 80% discounts during Black Friday, which is a month away. This strategy is anticipated to significantly boost our customer base and sales during this period. Consequently, we will need to expand our staff by 20% to handle the surge in orders.

It’s crucial to ensure that all your systems can successfully handle the anticipated high load, preventing any loss of profits. As the head of the development team, it falls on you to confirm that the marketplace can withstand this increased load and to make any necessary optimizations to accommodate the expected traffic.

Use the workflow configuration file you created in the first challenge as a starting point, and make changes to meet the following requirements:

Requirement 2.1: Generate five times as many records in the customer table

Requirement 2.2: Double the number of payment transactions by creating new records in the payment table

Requirement 2.3: Increase the staff count by 20% by generating new records in the staff table

Hints You can set how much data to synthesize by using `target_ratio` (or `target_rows`). You can produce any number of rows in `GENERATE` mode. In `MASKING` mode or `KEEP` mode you can't generate more rows than the input data.

Validating Challenge 2

Once you have changed your workflow configuration, run the workflow and examine the results in the output database.

When you are happy, check your work by running:

    docker compose run check_challenge_2

Update the workflow configuration file and rerun the transformation and tests until you receive the All is good message in your terminal.

Challenge 3: Producing data to test our features

The team is hard at work building new features. They are focused on two key projects:

To ensure these features are working properly, we want to test these use cases against realistic data. However, the current production data doesn’t have enough information for us to run these tests. Thankfully, Synthesized allows you to generate data that meets the requirements for the two projects.

To generate realistic data here you will need to use the categorical generator, format preserving hashing generator, and the person generator.

Challenge 3a - Recording staff activity

Until now, when staff members have left the company, they were deleted from the database. The company has decided to change the system so that instead of deleting staff, they will be marked as inactive instead. They keep track of this by setting the public.staff.active column to false. Currently, all of our production data is for active staff members, but we need test data that includes inactive staff.

Requirement 3a: Update the system to generate 80% active staff, and 20% inactive staff.

Hints The [categorical generator](https://docs.synthesized.io/tdk/latest/user_guide/130_reference/transformations#CategoricalGeneratorParams) can be used to generate values that automatically match the existing data, or you can use it to provide possible values and it will generate data that matches.

Challenge 3b - Predicting demand

The team’s machine learning engineers have been hard at work building models that can predict which geographical regions will have the greatest demand.

Addresses are key personally identifiable information. Up until now we have been using completely synthesized data for the public.address table. To properly test our new demand prediction system we need data that matches our real users, while maintaining their privacy.

To get the most realistic data, we need a few things:

Requirement 3b.1: The address should be masked and maintain the same structure as the original address

Hints If you select `MASKING` mode, [format preserving hashing](https://docs.synthesized.io/tdk/latest/user_guide/130_reference/transformations#FormatPreservingHashingParams) is selected automatically for certain columns.

Requirement 3b.2: The synthesized postal code should be a number that has the same length as the original postal code

Hints The [format preserving hashing generator](https://docs.synthesized.io/tdk/latest/user_guide/130_reference/transformations#FormatPreservingHashingParams) can be used to maintain the structure of the input data.

Requirement 3b.3: District isn’t personally identifiable information, so let’s keep the values from the source data

Hints Transformation modes (`MASKING`, `GENERATION`, & `KEEP`) can be set at the column level, the table level and the workflow level. The most specific mode will be used, so a column level setting will override table level settings.

Requirement 3b.4: Generate new but realistic phone numbers. These should be formatted as national phone numbers rather than international numbers.

Hints The [person generator](https://docs.synthesized.io/tdk/latest/user_guide/130_reference/transformations#PersonGeneratorParams) can be used to generate all sorts of information that could be associated with a person.

Validating Challenge 3

Once you have changed your workflow configuration, run the workflow and examine the results in the output database.

When you are happy, check your work by running:

    docker compose run check_challenge_3

Update the workflow configuration file and rerun the transformation and tests until you receive the All is good message in your terminal.

Summing up the results

Congratulations! If you’ve reached this point, it implies that:

We hope that Synthesized will be beneficial in addressing your needs for high-quality, realistic, and secure data for testing and development. We’re ready to answer all your questions and provide necessary assistance. Please feel free to email us at support@synthesized.io

Additionally, we recommend joining our social media, where you can find more useful information related to test, fake, synthetic, obfuscated, anonymized, and generated data, among other related topics:

Thank you for your attention and support. Let’s stay in touch.

Synthesized TDK Team