Data Lineage & Tracking
Track data transformations, understand data provenance, and audit how data flows through platform workflows.
Overview
Data lineage tracks the history and transformations applied to data. The platform provides several mechanisms to understand how source data becomes transformed data, which transformers were applied, and how records relate across databases.
Lineage Tracking Features
1. Workflow Execution History
The platform records every workflow execution:
-
Execution ID: Unique identifier for each run
-
Timestamp: When the workflow executed
-
Configuration: Snapshot of workflow config used
-
Results: Row counts, errors, duration
-
User: Who triggered the execution
Access via:
* Web UI: Workflows → History tab
* REST API: /api/workflows/{id}/executions
* Database: workflow_execution table (metadata DB)
2. Transformation Logs
Detailed logs of transformation process:
-
Tables processed and order
-
Transformers applied per column
-
Rows read, transformed, written
-
Errors and warnings
-
Performance metrics
Example log:
INFO: Processing table public.customer
INFO: Applying Email transformer to column 'email'
INFO: Applying Phone transformer to column 'phone'
INFO: Processed 10,000 rows in 12 seconds (833 rows/sec)
INFO: Table public.customer complete
3. Configuration Versioning
Track configuration changes over time:
-
Store workflow configs in version control (Git)
-
Tag executions with Git commit hash
-
Compare configurations across runs
-
Rollback to previous configs
Best Practice:
# Store in Git
git add workflow.yaml
git commit -m "Mask SSN column in customer table"
git tag v1.2.3
4. Data Provenance
Understand data origins:
Masking Mode: * Source row ID → Destination row ID mapping * Typically 1:1 mapping * Primary keys usually preserved
Generation Mode: * Seed value determines reproducibility * Same seed = same generated data * Random seed = different data each run
Subsetting Mode: * Track which source rows were included * Record filter conditions applied * Document FK traversal paths
Tracking Mechanisms
Deterministic Transformations
Ensure reproducibility with seeds:
default_config:
seed: 42 # Global seed for all transformers
Result: * Same input + same seed = same output * Reproducible across runs * Useful for testing and validation
Audit Trails
Track who did what:
-
User authentication (LDAP, SSO)
-
Workflow execution triggers
-
Configuration changes
-
Data access patterns
See: RBAC
Use Cases
1. Compliance Reporting
Demonstrate compliance with regulations:
GDPR Example: * Show all PII columns masked * Prove irreversibility of masking * Document retention policies * Audit data access
Report includes: * Workflow configuration (what was masked) * Execution history (when it ran) * Transformer details (how it was masked) * Access logs (who accessed the data)
2. Debugging
Investigate data issues:
Scenario: "Customer 123 has wrong email in dev database"
Investigation: 1. Check workflow execution logs 2. Find when customer 123 was processed 3. Review Email transformer config 4. Verify masking logic 5. Reproduce with same seed if needed
3. Change Impact Analysis
Understand configuration changes:
Before changing transformer: * Review current configuration * Check execution history * Document expected changes * Test on small dataset
After changing transformer: * Compare output with previous run * Validate data quality * Check referential integrity * Review performance impact
Lineage Visualization
Workflow Dependency Graph
Visualize table dependencies:
customer ├── orders │ ├── order_items │ └── shipments ├── addresses └── payment_methods
Available in: Web UI (Workflow → Visualization)
Best Practices
1. Use Git for Configurations
git init
git add workflow.yaml inventory.yaml
git commit -m "Initial workflow configuration"
Benefits: * Version history * Diff configurations * Rollback capabilities * Collaboration
3. Document Transformers
Add comments to configurations:
transformations:
# Mask email addresses for GDPR compliance
# Uses deterministic masking (same input → same output)
- columns: ["email"]
type: Email
params:
unique: true
seed: 12345
Integration with Data Catalogs
Integrate platform lineage with external tools:
Metadata Export
Export to common formats:
-
JSON: Structured metadata
-
CSV: Tabular format
-
GraphML: Graph format for visualization
What’s Next
-
Best Practices - Configuration management
-
Security & Compliance - Audit and compliance
-
Public API - API for lineage