CTEs — the building blocks of dbt models
Common Table Expressions (CTEs) are how almost every dbt model is structured. Learn the conventions, why they matter, and the import → logic → final pattern.
Learning Objectives
By the end of this lesson, you will be able to:
- Write a multi-CTE dbt model in the standard
import → logic → finalshape - Explain why CTEs are preferred over subqueries in dbt projects
- Recognize the conventional naming patterns (
source_data,final, etc.) - Know what BigQuery actually does with CTEs at execution time
- Spot when a CTE should become its own model instead
What a CTE is
A Common Table Expression is a named SQL subquery you define at the top of a statement and reference by name below. Syntax:
with active_members as (
select *
from {{ ref('family_members') }}
where role in ('parent', 'child')
)
select count(*) from active_members
That's a one-CTE model. The with active_members as (...) block
defines a named query; the select at the end uses it like a table.
CTEs chain — you can define many in one with, each able to reference
prior ones:
with
active_members as (
select * from {{ ref('family_members') }}
where role in ('parent', 'child')
),
parents as (
select * from active_members where role = 'parent'
),
parent_emails as (
select email from parents
)
select * from parent_emails
Three CTEs, each building on the previous. One final select. One
model.
Why dbt models lean on CTEs so heavily
Almost every nontrivial dbt model you'll read is structured as a chain of CTEs. Three reasons:
1. Readability
A 200-line model written with subqueries is a nightmare to read. The same logic split into 5 named CTEs reads top-to-bottom like a recipe.
2. Composability
You can build the model in stages, run each CTE in isolation while
debugging (replace the final select temporarily to select * from some_intermediate_cte), and recombine without restructuring.
3. The pattern matches dbt's philosophy
dbt models are about describing data transformations clearly. CTEs let you name each step. You give intent to every piece of the transformation, which becomes documentation by default.
The standard dbt model shape
The dbt community has converged on a conventional layout. Most models
you'll read in dbt-nexus and elsewhere follow this shape:
{{ config(materialized='table') }}
-- 1. Imports — one CTE per source/ref, no transformations yet
with
family_members as (
select * from {{ ref('family_members') }}
),
gmail_events as (
select * from {{ ref('gmail_events') }}
),
-- 2. Logic — one CTE per meaningful step, transformations live here
parent_emails as (
select email
from family_members
where role = 'parent'
),
emails_to_parents as (
select g.*
from gmail_events g
join parent_emails p on lower(g.recipient_email) = lower(p.email)
),
-- 3. Final — exactly one select; no logic, just pick the columns you
-- want to return
final as (
select
event_id,
occurred_at,
recipient_email,
subject
from emails_to_parents
)
select * from final
Three sections, in order:
| Section | Purpose | Convention |
|---|---|---|
| Imports | Pull every ref() / source() into a named CTE |
One CTE per ref, named after the ref |
| Logic | Each meaningful transformation gets a named CTE | Verb-y or noun-y names that describe the step |
| Final | One last CTE called final |
Just picks columns and order; no joins or filters |
The model ends with select * from final. That last line is almost
always the same — the model's interesting work is above it.
Naming conventions
- Import CTEs use the same name as the model they reference
(
family_members,gmail_events). Sometimes shortened (messagesforstg_gmail_messages). - Logic CTEs describe what the step produces, not what it does
(
parent_emails,emails_to_parents,most_recent_message_per_user). - Final CTE is literally named
final. Universal convention.
There's no compiler magic to any of these — they're just conventions that make every model readable at a glance.
CTEs vs subqueries
The same model written with subqueries instead of CTEs:
{{ config(materialized='table') }}
select
event_id,
occurred_at,
recipient_email,
subject
from (
select g.*
from {{ ref('gmail_events') }} g
join (
select email
from {{ ref('family_members') }}
where role = 'parent'
) p on lower(g.recipient_email) = lower(p.email)
)
Same logic. Much harder to follow:
- Reading bottom-up to figure out what
pis - Nested parens you have to mentally bracket
- No names to ask "what does this CTE produce?"
Worse: when you want to debug, you can't easily run a subquery in
isolation. With CTEs you change select * from final to
select * from parent_emails and run.
The rule: use CTEs, not subqueries. Almost every dbt-nexus model follows this rule.
What BigQuery actually does at execution time
A practical note: CTEs in BigQuery (and most modern warehouses) are
inlined at execution time. They're not materialized as temporary
tables. Each select that references a CTE re-executes the CTE's
logic.
What this means in practice:
- If you reference the same CTE three times downstream, it runs three times
- For tiny CTEs (a few rows, a simple filter) this is fine
- For expensive CTEs (giant joins, window functions over millions of
rows) referenced many times, you may want to break them out into a
separate model with
materialized='table'
dbt does have an ephemeral materialization (see
2.4 Materializations) — but at execution it works
the same way: dbt inlines the model as a CTE in every downstream
model. Same trade-off, just managed by dbt.
When a CTE should become its own model
A rough rule: if a CTE is referenced by more than one downstream model, it deserves to be its own dbt model.
| Signal | Probably keep as a CTE | Probably extract to a model |
|---|---|---|
| Used only inside this one model | ✅ | — |
| Tiny, simple transform | ✅ | — |
| Re-used across 2+ models | — | ✅ |
| Expensive to compute, referenced multiple times | — | ✅ (as table) |
| Has its own meaningful business identity | — | ✅ |
The win when you extract: the work happens once, downstream models share the result, and your DAG accurately reflects the dependency.
A complete Doe family example
Let's build a model that counts how many calendar events each parent attended in the last 30 days.
models/marts/parent_calendar_activity.sql:
{{ config(materialized='table') }}
with
-- Imports
family_members as (
select * from {{ ref('family_members') }}
),
calendar_events as (
select * from {{ ref('google_calendar_events') }}
),
calendar_attendees as (
select * from {{ ref('google_calendar_entity_identifiers') }}
),
-- Logic
parents as (
select member, lower(email) as email
from family_members
where role = 'parent'
),
recent_attendance as (
select
a.event_id,
a.occurred_at,
lower(a.identifier_value) as attendee_email
from calendar_attendees a
where a.identifier_type = 'email'
and a.role = 'attendee'
and a.occurred_at >= timestamp_sub(current_timestamp(), interval 30 day)
),
parent_attendance as (
select
p.member,
r.event_id,
r.occurred_at
from recent_attendance r
join parents p on p.email = r.attendee_email
),
-- Final
final as (
select
member,
count(distinct event_id) as events_attended_30d,
max(occurred_at) as last_attended_at
from parent_attendance
group by member
)
select * from final
Six CTEs total — three imports, two logic steps, one final. Every step
has a name that tells you what it produces. Anyone reading this can
follow the chain top-to-bottom without ever wondering "wait, what's
p?"
Hands-On Exercise
-
Take this nested-subquery model:
select email_count, member from ( select count(*) as email_count, m.member from {{ source('gmail', 'messages') }} g join ( select member, email from {{ ref('family_members') }} where role = 'child' ) m on lower(g.sender_email) = lower(m.email) group by m.member ) order by email_count descRefactor it into the
import → logic → finalCTE shape. Save it asmodels/marts/kid_email_send_counts.sqland run it. -
Build the
parent_calendar_activitymodel from the worked example above. Run it and inspect the output. -
(Stretch) Pick the longest model you've written so far and refactor it to use the convention. Read both versions back-to-back — which is easier to follow?
Summary
| Concept | Key takeaway |
|---|---|
| CTE | A named subquery in a with clause; reference it like a table below |
| Why use them | Readability, composability, debuggability |
| Standard shape | import → logic → final — three sections, one final select * from final |
| Naming | Import CTEs match their refs; logic CTEs describe what they produce |
| vs subqueries | Always prefer CTEs in dbt models |
| Execution | BigQuery inlines CTEs; reused CTEs re-execute on every reference |
| When to extract | A CTE used by 2+ models, or expensive + re-used, should be its own model |
Next Lesson
You can structure a model now. The next thing is how dbt persists it to BigQuery — that's 2.4 Materializations.