Nexus Data

Learning Objectives

By the end of this lesson, you will be able to:

Write a multi-CTE dbt model in the standard import → logic → final shape
Explain why CTEs are preferred over subqueries in dbt projects
Recognize the conventional naming patterns (source_data, final, etc.)
Know what BigQuery actually does with CTEs at execution time
Spot when a CTE should become its own model instead

What a CTE is

A Common Table Expression is a named SQL subquery you define at the top of a statement and reference by name below. Syntax:

with active_members as (
    select *
    from {{ ref('family_members') }}
    where role in ('parent', 'child')
)

select count(*) from active_members

That's a one-CTE model. The with active_members as (...) block defines a named query; the select at the end uses it like a table.

CTEs chain — you can define many in one with, each able to reference prior ones:

with
active_members as (
    select * from {{ ref('family_members') }}
    where role in ('parent', 'child')
),

parents as (
    select * from active_members where role = 'parent'
),

parent_emails as (
    select email from parents
)

select * from parent_emails

Three CTEs, each building on the previous. One final select. One model.

Why dbt models lean on CTEs so heavily

Almost every nontrivial dbt model you'll read is structured as a chain of CTEs. Three reasons:

1. Readability

A 200-line model written with subqueries is a nightmare to read. The same logic split into 5 named CTEs reads top-to-bottom like a recipe.

2. Composability

You can build the model in stages, run each CTE in isolation while debugging (replace the final select temporarily to select * from some_intermediate_cte), and recombine without restructuring.

3. The pattern matches dbt's philosophy

dbt models are about describing data transformations clearly. CTEs let you name each step. You give intent to every piece of the transformation, which becomes documentation by default.

The standard dbt model shape

The dbt community has converged on a conventional layout. Most models you'll read in dbt-nexus and elsewhere follow this shape:

{{ config(materialized='table') }}

-- 1. Imports — one CTE per source/ref, no transformations yet
with
family_members as (
    select * from {{ ref('family_members') }}
),

gmail_events as (
    select * from {{ ref('gmail_events') }}
),

-- 2. Logic — one CTE per meaningful step, transformations live here
parent_emails as (
    select email
    from family_members
    where role = 'parent'
),

emails_to_parents as (
    select g.*
    from gmail_events g
    join parent_emails p on lower(g.recipient_email) = lower(p.email)
),

-- 3. Final — exactly one select; no logic, just pick the columns you
-- want to return
final as (
    select
        event_id,
        occurred_at,
        recipient_email,
        subject
    from emails_to_parents
)

select * from final

Three sections, in order:

Section	Purpose	Convention
Imports	Pull every `ref()` / `source()` into a named CTE	One CTE per ref, named after the ref
Logic	Each meaningful transformation gets a named CTE	Verb-y or noun-y names that describe the step
Final	One last CTE called `final`	Just picks columns and order; no joins or filters

The model ends with select * from final. That last line is almost always the same — the model's interesting work is above it.

Naming conventions

Import CTEs use the same name as the model they reference (family_members, gmail_events). Sometimes shortened (messages for stg_gmail_messages).
Logic CTEs describe what the step produces, not what it does (parent_emails, emails_to_parents, most_recent_message_per_user).
Final CTE is literally named final. Universal convention.

There's no compiler magic to any of these — they're just conventions that make every model readable at a glance.

CTEs vs subqueries

The same model written with subqueries instead of CTEs:

{{ config(materialized='table') }}

select
    event_id,
    occurred_at,
    recipient_email,
    subject
from (
    select g.*
    from {{ ref('gmail_events') }} g
    join (
        select email
        from {{ ref('family_members') }}
        where role = 'parent'
    ) p on lower(g.recipient_email) = lower(p.email)
)

Same logic. Much harder to follow:

Reading bottom-up to figure out what p is
Nested parens you have to mentally bracket
No names to ask "what does this CTE produce?"

Worse: when you want to debug, you can't easily run a subquery in isolation. With CTEs you change select * from final to select * from parent_emails and run.

The rule: use CTEs, not subqueries. Almost every dbt-nexus model follows this rule.

What BigQuery actually does at execution time

A practical note: CTEs in BigQuery (and most modern warehouses) are inlined at execution time. They're not materialized as temporary tables. Each select that references a CTE re-executes the CTE's logic.

What this means in practice:

If you reference the same CTE three times downstream, it runs three times
For tiny CTEs (a few rows, a simple filter) this is fine
For expensive CTEs (giant joins, window functions over millions of rows) referenced many times, you may want to break them out into a separate model with materialized='table'

dbt does have an ephemeral materialization (see 2.4 Materializations) — but at execution it works the same way: dbt inlines the model as a CTE in every downstream model. Same trade-off, just managed by dbt.

When a CTE should become its own model

A rough rule: if a CTE is referenced by more than one downstream model, it deserves to be its own dbt model.

Signal	Probably keep as a CTE	Probably extract to a model
Used only inside this one model	✅	—
Tiny, simple transform	✅	—
Re-used across 2+ models	—	✅
Expensive to compute, referenced multiple times	—	✅ (as `table`)
Has its own meaningful business identity	—	✅

The win when you extract: the work happens once, downstream models share the result, and your DAG accurately reflects the dependency.

A complete Doe family example

Let's build a model that counts how many calendar events each parent attended in the last 30 days.

models/marts/parent_calendar_activity.sql:

{{ config(materialized='table') }}

with
-- Imports
family_members as (
    select * from {{ ref('family_members') }}
),

calendar_events as (
    select * from {{ ref('google_calendar_events') }}
),

calendar_attendees as (
    select * from {{ ref('google_calendar_entity_identifiers') }}
),

-- Logic
parents as (
    select member, lower(email) as email
    from family_members
    where role = 'parent'
),

recent_attendance as (
    select
        a.event_id,
        a.occurred_at,
        lower(a.identifier_value) as attendee_email
    from calendar_attendees a
    where a.identifier_type = 'email'
      and a.role = 'attendee'
      and a.occurred_at >= timestamp_sub(current_timestamp(), interval 30 day)
),

parent_attendance as (
    select
        p.member,
        r.event_id,
        r.occurred_at
    from recent_attendance r
    join parents p on p.email = r.attendee_email
),

-- Final
final as (
    select
        member,
        count(distinct event_id) as events_attended_30d,
        max(occurred_at)          as last_attended_at
    from parent_attendance
    group by member
)

select * from final

Six CTEs total — three imports, two logic steps, one final. Every step has a name that tells you what it produces. Anyone reading this can follow the chain top-to-bottom without ever wondering "wait, what's p?"

Hands-On Exercise

Take this nested-subquery model:

select email_count, member
from (
    select count(*) as email_count, m.member
    from {{ source('gmail', 'messages') }} g
    join (
        select member, email
        from {{ ref('family_members') }}
        where role = 'child'
    ) m on lower(g.sender_email) = lower(m.email)
    group by m.member
)
order by email_count desc

Refactor it into the import → logic → final CTE shape. Save it as models/marts/kid_email_send_counts.sql and run it.

Build the parent_calendar_activity model from the worked example above. Run it and inspect the output.
(Stretch) Pick the longest model you've written so far and refactor it to use the convention. Read both versions back-to-back — which is easier to follow?

Summary

Concept	Key takeaway
CTE	A named subquery in a `with` clause; reference it like a table below
Why use them	Readability, composability, debuggability
Standard shape	`import → logic → final` — three sections, one final `select * from final`
Naming	Import CTEs match their refs; logic CTEs describe what they produce
vs subqueries	Always prefer CTEs in dbt models
Execution	BigQuery inlines CTEs; reused CTEs re-execute on every reference
When to extract	A CTE used by 2+ models, or expensive + re-used, should be its own model

Next Lesson

You can structure a model now. The next thing is how dbt persists it to BigQuery — that's 2.4 Materializations.