How Subsetting Works

Learn how to create smaller, representative subsets of production databases while maintaining referential integrity.

Overview

Sometimes you only need a small subset of your production database - perhaps 10%, 25%, or 50% of the data for development, testing, or demo purposes. The Synthesized Platform makes this easy with the target_ratio configuration option.

Basic Subsetting Configuration

The simplest way to subset a database is to use the target_ratio option with KEEP mode:

default_config:
  mode: KEEP
  target_ratio: 0.5

table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS

This configuration:

  • Preserves 50% of records from each table

  • Randomly selects rows while respecting foreign key constraints

  • Maintains referential integrity - all foreign keys remain valid

  • Truncates output tables before writing data

  • Creates schema if it doesn’t exist in the output database

How It Works

When you run a subsetting workflow:

  1. Random Selection: The platform randomly selects the specified percentage of rows from each table

  2. Foreign Key Following: For selected rows, it automatically includes all related records in child tables

  3. Integrity Preservation: All foreign key relationships remain valid - no orphaned records

default_config:
  mode: KEEP
  target_ratio: 0.5

Result: Approximately 50% of your database, with all relationships intact.

Understanding Subsetting Results

Due to foreign key constraints, subsetting may produce different results for different tables. Some tables may have:

  • Fewer records than expected - Child tables depend on parent tables

  • Zero records - If parent tables have no qualifying records

  • More records than expected - Referenced parent records must be included

Table 1. Example: Subsetting Results with 0.5 Ratio
tablename numrows

actor

100

address

131

category

8

city

280

film

0

film_actor

0

payment

8025

rental

0

In this example, the film table ended up with 0 records, which caused film_actor and rental tables to also have 0 records due to foreign key constraints.

Table-Specific Subsetting

For more control, you can apply different ratios to specific tables. This is especially useful when you want to reduce only the largest tables:

default_config:
  mode: KEEP

tables:
  - table_name_with_schema: "public.rental"
    target_ratio: 0.1
  - table_name_with_schema: "public.payment"
    target_ratio: 0.1

This configuration:

  • Reduces rental table to 10% of original size

  • Reduces payment table to 10% of original size

  • Keeps other tables unchanged (no target_ratio specified)

Table 2. Example: Results with Table-Specific Ratios
tablename numrows

actor

200

address

603

category

16

city

600

film

1000

film_actor

5462

payment

1605

rental

1605

By targeting only specific tables, you get better control over the final database size while maintaining referential integrity.

Combining Subsetting with Filtering

You can combine subsetting with data filtering for even more precise control. See Data Filtering for details on how to:

  • Filter by specific column values

  • Filter by date ranges

  • Use complex WHERE conditions

  • Apply filters to multiple tables

Common Use Cases

Development Environment (50% of production)

default_config:
  mode: KEEP
  target_ratio: 0.5

Small Test Environment (10% of production)

default_config:
  mode: KEEP
  target_ratio: 0.1

Reduce Only Large Transaction Tables

default_config:
  mode: KEEP

tables:
  - table_name_with_schema: "public.orders"
    target_ratio: 0.1
  - table_name_with_schema: "public.order_items"
    target_ratio: 0.1
  - table_name_with_schema: "public.logs"
    target_ratio: 0.05

Performance Considerations

Subsetting performance depends on:

  • Database size - Larger databases take longer

  • Number of foreign keys - More relationships require more traversal

  • Target ratio - Smaller ratios are generally faster

For optimal performance:

  1. Index foreign key columns in your source database

  2. Use table-specific ratios instead of global ratios when possible

  3. Consider filtering in addition to subsetting for very specific needs

What’s Next