Development Data Filtering
Guide for limiting data volume in development environments using timestamp-based filtering to improve build performance while maintaining data integrity.
When working with large datasets spanning years or decades of data, development
builds can become slow and expensive. The limit_by_timestamp macro provides a
clean, maintainable way to filter data by timestamp in development environments
while keeping full datasets in production.
The Problem
Large source datasets create challenges in development:
- Slow builds: Processing decades of data for every test run
- High costs: Expensive warehouse compute for development iterations
- Poor developer experience: 7+ minute builds become 20-second builds with filtering
- Unnecessary data: Most development work only needs recent data
The Solution: limit_by_timestamp Macro
The limit_by_timestamp macro provides environment-aware timestamp filtering
that:
- ✅ Filters by actual dates (e.g.,
2024-01-01) not relative time windows - ✅ Configurable per environment (dev, prod, or custom targets)
- ✅ Override-able per model when needed
- ✅ Cross-database compatible (Snowflake, BigQuery, Postgres, etc.)
- ✅ Cascades automatically through joins and downstream models
- ✅ Maintains data integrity and relationships
Configuration
Basic Setup
Add timestamp limits to your dbt_project.yml:
vars:
# Development data limiting
dev_timestamp_limit: '2024-01-01' # Only data from 2024 onwards in dev
prod_timestamp_limit: '2019-01-01' # Optional: limit prod data if needed
Environment Targets
The macro automatically detects your environment:
devordevelopmenttargets → usesdev_timestamp_limitprodorproductiontargets → usesprod_timestamp_limit- Other targets → no filtering (returns
true)
If a limit variable is not set for an environment, no filtering occurs.
Usage Patterns
1. Filter at the Event Level (Recommended)
Apply the filter in intermediate layer event models where you have clean, normalized timestamps:
-- intermediate/source_order_events.sql
{{ config(tags=['identity-resolution', 'events'], materialized='table') }}
with orders as (
select * from {{ ref('source_orders') }}
where {{ nexus.limit_by_timestamp('order_date') }} -- 👈 Filter here
),
order_events as (
select
{{ nexus.create_nexus_id('event', ['order_id', 'order_date']) }} as event_id,
order_date as occurred_at,
'order' as event_type,
'order_placed' as event_name,
'source' as source,
-- ... other fields
from orders
)
select * from order_events
2. Multiple Event Types from Same Source
Filter once at the source CTE:
-- intermediate/source_customer_events.sql
with customers as (
select * from {{ ref('source_customers') }}
where {{ nexus.limit_by_timestamp('created_at') }} -- 👈 Filter once
),
signup_events as (
select
created_at as occurred_at,
'signup' as event_name,
-- ... fields
from customers
where created_at is not null
),
profile_update_events as (
select
updated_at as occurred_at,
'profile_updated' as event_name,
-- ... fields
from customers
where updated_at is not null
)
select * from signup_events
union all
select * from profile_update_events
3. Custom Column Names
Override the default occurred_at column:
where {{ nexus.limit_by_timestamp('enrollment_date') }}
where {{ nexus.limit_by_timestamp('first_seen_at') }}
4. Per-Model Overrides
Override the global limit for specific models:
-- This model needs only the last year in dev
where {{ nexus.limit_by_timestamp('occurred_at', dev_limit='2024-01-01') }}
-- This model needs more history in dev but less in prod
where {{ nexus.limit_by_timestamp('occurred_at', dev_limit='2022-01-01', prod_limit='2020-01-01') }}
Automatic Cascade Effect
The beauty of filtering at the event level is that it cascades automatically through your entire pipeline:
✅ Events (filtered)
↓
✅ Person Identifiers (inherits filter via ref)
↓
✅ Person Traits (inherits filter via ref)
↓
✅ Group Identifiers (inherits filter via ref)
↓
✅ Group Traits (inherits filter via ref)
↓
✅ Unioned Models (inherits filter via union)
↓
✅ Identity Resolution (processes filtered data)
Example Cascade
-- intermediate/source_order_events.sql
with orders as (
select * from {{ ref('source_orders') }}
where {{ nexus.limit_by_timestamp('order_date') }} -- ✅ Filter applied
)
-- ... event logic
-- Note: With entity-centric architecture, identifiers/traits are extracted
-- in the source union models (source_entity_identifiers.sql, source_entity_traits.sql)
-- rather than in separate intermediate models per event type.
-- source_events.sql (union layer)
{{ dbt_utils.union_relations([
ref('source_order_events'), -- ✅ Already filtered!
ref('source_payment_events'), -- ✅ Already filtered!
ref('source_support_events') -- ✅ Already filtered!
]) }}
Where to Apply Filters
DO Apply Filters In
- Intermediate event models - Where clean timestamps exist
- Source CTEs - At the top of event models for single-point filtering
- Event-generating queries - Before creating event records
DON'T Apply Filters In
- Base layer - Timestamps may not be cleaned/standardized yet
- Normalized layer - Keep full normalization/deduplication logic intact
- Identifier/Trait models - They inherit from filtered events automatically
- Union models - They inherit from filtered intermediate models
- Identity resolution models - They process already-filtered data
Architecture Pattern
Following the recommended four-layer architecture:
Layer 1: Base (raw) → ❌ No filter (raw data)
Layer 2: Normalized → ❌ No filter (full normalization)
Layer 3: Intermediate (events) → ✅ FILTER HERE
Layer 4: Unioned → ❌ No filter (inherits)
Performance Benefits
Real-world results from filtering at the intermediate layer:
- Before filtering: 7+ minutes to build source models
- After filtering: 20 seconds to build source models
- Speedup: ~21x faster builds in development
- Data integrity: Maintained across all relationships and joins
Join Behavior
When filtering events that are joined with other tables:
with events as (
select * from {{ ref('source_events') }}
where {{ nexus.limit_by_timestamp('event_date') }} -- Small filtered dataset
),
participants as (
select * from {{ ref('source_participants') }}
-- Full table, but...
),
events_with_participants as (
select
e.*,
p.participant_name,
p.participant_email
from events e -- ← Filtered (small)
left join participants p -- ← Full table (large)
on e.participant_id = p.id
)
Result: Only participants matching the filtered events are returned. The database optimizer handles this efficiently.
Optional Performance Optimization: For very large joined tables, you can pre-filter them too:
participants as (
select * from {{ ref('source_participants') }}
where {{ nexus.limit_by_timestamp('created_at') }}
),
This is optional but may improve join performance on extremely large datasets.
Cross-Database Compatibility
The macro uses dbt's cross-database functions:
-- Generated SQL (works across all databases)
occurred_at >= '2024-01-01'
This simple comparison works identically in:
- Snowflake
- BigQuery
- Postgres
- Redshift
- Databricks
Testing Your Filters
Verify Row Counts
-- Check event counts before/after
select
min(occurred_at) as earliest_event,
max(occurred_at) as latest_event,
count(*) as total_events
from {{ ref('source_events') }}
Verify Cascade
-- Ensure identifiers match filtered events
select count(distinct event_id) as unique_event_ids
from {{ ref('source_entity_identifiers') }}
where entity_type = 'person'
-- Should match event count (assuming 1:1 relationship)
select count(*) as total_events
from {{ ref('source_events') }}
Troubleshooting
Filter Not Working
Check your target name:
dbt run --target dev # Should use dev_timestamp_limit
The macro only applies to targets named dev, development, prod, or
production.
Different Results in Dev vs Prod
This is expected behavior! Dev has less data by design. To test with full data locally:
# Option 1: Use prod target
dbt run --target prod
# Option 2: Temporarily remove the limit
dbt run --vars '{"dev_timestamp_limit": null}'
# Option 3: Override with older date
dbt run --vars '{"dev_timestamp_limit": "2019-01-01"}'
Still Too Much Data
Adjust your limit to a more recent date:
vars:
dev_timestamp_limit: '2024-06-01' # Only last 6 months
Best Practices
- Start conservative: Begin with more data (2+ years) then reduce if needed
- One filter point: Apply at event level, let it cascade
- Document limits: Comment why you chose specific dates in
dbt_project.yml - Test in prod: Periodically run with prod data to catch edge cases
- Monitor performance: Track build times to verify improvements
- Consider data relationships: Ensure filtered timeframe includes complete business cycles
Related Documentation
- Recommended Source Model Structure - Four-layer architecture
- Source Tests - Testing strategies for source models
- Create Source Models - Step-by-step source creation guide