Linking CSV Files in a Workflow

Generate multiple CSV files in a single workflow and link columns across them with virtual foreign keys.

Overview

A single workflow can generate data for more than one CSV file at a time. When the input directory contains multiple table folders, each folder is treated as a separate logical table within the same public schema. You can then declare foreign-key relationships in the configuration so that values generated in one file reference values generated in another.

This is useful when you need referentially consistent data across files — for example, a parent file of profiles and a child file of transactions where each transaction must point to an existing profile.

Directory Structure

Organise the input bucket (or local directory) so that each linked file lives in its own table folder under a common parent directory:

my-bucket/my-input-dir/
├── PartyProfile/
│   └── profiles.csv
└── Equity/
    └── transactions.csv

Each folder name becomes the table name. After the workflow runs, the output directory mirrors the same structure, with a single data.csv per folder:

my-bucket/my-output-dir/
├── PartyProfile/
│   └── data.csv
└── Equity/
    └── data.csv

Configuration

The configuration defines both tables under tables, then declares the relationship between them under metadata.tables.foreign_keys. The schema is always public for CSV workflows.

In the example below:

  • PartyProfile is the parent table. Its Profile ID column is generated as a 10-digit sequence and kept (mode: KEEP) so child rows have stable values to reference.

  • Equity is the child table. It generates 300 rows of transaction data, and its Profile ID column is linked back to the parent through a foreign key.

default_config:
  mode: GENERATION
  target_row_number: 100
  use_working_directory: false
  locale: "en-GB"

tables:
  - table_name_with_schema: "public.PartyProfile"
    mode: KEEP
    transformations:
      - columns: ["Profile ID"]
        params:
          type: string_sequence_generator
          length: 10
          start_from: "1"
          alphabets:
            - type: digits

  - table_name_with_schema: "public.Equity"
    target_row_number: 300
    transformations:
      - columns: ["FI name", "FI val"]
        params:
          type: loop_generator
          repeatable: true
          source:
            value_source: "MULTIPLE_PROVIDED"
            column_types:
              "FI name": STRING
              "FI val": NUMERIC
            values:
              - {"FI name": "Chase",       "FI val": "213"}
              - {"FI name": "Wells Fargo", "FI val": "3434"}

      - columns: ["Amount"]
        params:
          type: continuous_generator
          numeric_type: FLOAT
          min: 1200.34
          max: 123654.98
          mean: 2344.00
          std: 3423.00
          round: 2

      - columns: ["First Name"]
        params:
          type: person_generator
          column_templates: ["${first_name}"]

      - columns: ["Last name"]
        params:
          type: person_generator
          column_templates: ["${last_name}"]

metadata:
  tables:
    - table_name_with_schema: "public.Equity"
      foreign_keys:
        fk_user_order:
          referred_schema: "public"
          referred_table: "PartyProfile"
          columns:
            - column: "Profile ID"
              referred_column: "Profile ID"

cycle_resolution_strategy: "FAIL"
schema_creation_mode: "DROP_AND_CREATE"
table_truncation_mode: TRUNCATE
safety_mode: "RELAXED"
use_text_column_heuristics: true

The metadata.tables.foreign_keys block declares a virtual foreign key between the two files. Synthesized:

  1. Generates the parent table (PartyProfile) first, producing the pool of Profile ID values.

  2. Generates the child table (Equity), drawing each Profile ID value from the parent pool rather than inventing a new one.

The result is that every Profile ID in Equity/data.csv exists in PartyProfile/data.csv, so the two files can be joined safely downstream.

Since CSV files do not carry foreign-key metadata of their own, the relationship must be declared in the configuration. This is the same mechanism used for virtual foreign keys on database workflows.