How Subsetting Works
Learn how to create smaller, representative subsets of production databases while maintaining referential integrity.
Overview
Sometimes you only need a small subset of your production database - perhaps 10%, 25%, or 50% of the data for development, testing, or demo purposes. The Synthesized Platform makes this easy with the target_ratio configuration option.
Basic Subsetting Configuration
The simplest way to subset a database is to use the target_ratio option with KEEP mode:
default_config:
mode: KEEP
target_ratio: 0.5
table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
This configuration:
-
Preserves 50% of records from each table
-
Randomly selects rows while respecting foreign key constraints
-
Maintains referential integrity - all foreign keys remain valid
-
Truncates output tables before writing data
-
Creates schema if it doesn’t exist in the output database
How It Works
When you run a subsetting workflow:
-
Random Selection: The platform randomly selects the specified percentage of rows from each table
-
Foreign Key Following: For selected rows, it automatically includes all related records in child tables
-
Integrity Preservation: All foreign key relationships remain valid - no orphaned records
default_config:
mode: KEEP
target_ratio: 0.5
Result: Approximately 50% of your database, with all relationships intact.
Understanding Subsetting Results
Due to foreign key constraints, subsetting may produce different results for different tables. Some tables may have:
-
Fewer records than expected - Child tables depend on parent tables
-
Zero records - If parent tables have no qualifying records
-
More records than expected - Referenced parent records must be included
| tablename | numrows |
|---|---|
actor |
100 |
address |
131 |
category |
8 |
city |
280 |
film |
0 |
film_actor |
0 |
payment |
8025 |
rental |
0 |
In this example, the film table ended up with 0 records, which caused film_actor and rental tables to also have 0 records due to foreign key constraints.
Table-Specific Subsetting
For more control, you can apply different ratios to specific tables. This is especially useful when you want to reduce only the largest tables:
default_config:
mode: KEEP
tables:
- table_name_with_schema: "public.rental"
target_ratio: 0.1
- table_name_with_schema: "public.payment"
target_ratio: 0.1
This configuration:
-
Reduces rental table to 10% of original size
-
Reduces payment table to 10% of original size
-
Keeps other tables unchanged (no target_ratio specified)
| tablename | numrows |
|---|---|
actor |
200 |
address |
603 |
category |
16 |
city |
600 |
film |
1000 |
film_actor |
5462 |
payment |
1605 |
rental |
1605 |
By targeting only specific tables, you get better control over the final database size while maintaining referential integrity.
Combining Subsetting with Filtering
You can combine subsetting with data filtering for even more precise control. See Data Filtering for details on how to:
-
Filter by specific column values
-
Filter by date ranges
-
Use complex WHERE conditions
-
Apply filters to multiple tables
Performance Considerations
Subsetting performance depends on:
-
Database size - Larger databases take longer
-
Number of foreign keys - More relationships require more traversal
-
Target ratio - Smaller ratios are generally faster
For optimal performance:
-
Index foreign key columns in your source database
-
Use table-specific ratios instead of global ratios when possible
-
Consider filtering in addition to subsetting for very specific needs
What’s Next
-
Subsetting Guide - Step-by-step tutorial
-
Data Filtering - Advanced filtering techniques
-
Configuration Reference - Complete configuration options