Creating models, refs, and lineage

A model is a .sql file. ref() and source() build the dependency graph. dbt uses that graph (the DAG) to run models in the right order.

Learning Objectives

By the end of this lesson, you will be able to:

  • Write a dbt model as a single select statement in a .sql file
  • Use ref() to depend on another model
  • Use source() to depend on a raw table outside dbt
  • Read and reason about a dbt DAG (directed acyclic graph)
  • Use selectors (+, -) to run subsets of the graph

This lesson is the conceptual heart of dbt. Everything else in this module (materializations, Jinja, macros, packages) builds on the mental model you build here.


A model is a select statement

In dbt, a model is a single .sql file in models/. It contains one select statement that describes the rows the model should contain. Nothing else — no create table, no insert, no DDL.

For example, models/staging/active_family_members.sql:

select
    member,
    email,
    role
from {{ ref('family_members') }}
where role in ('parent', 'child')

When you run dbt build, dbt wraps this select in the right DDL for your warehouse and materialization (create or replace table ... as select ... for BigQuery + table), executes it, and stores the result. You write the what; dbt handles the how.

That's the single biggest mental shift coming from raw SQL scripts: you stop writing DDL. dbt manages object creation, dependencies, schema names, and rebuilds. You describe the data you want.


ref() — the function that builds the DAG

When one model needs data from another, never hard-code the table name. Use ref():

-- ✅ Right
select * from {{ ref('family_members') }}

-- ❌ Wrong
select * from doe_family_dev.family_members

ref() does two things:

  1. At compile time, dbt replaces {{ ref('family_members') }} with the fully qualified table name for the current target (dev vs prod, your schema vs a teammate's).
  2. At parse time, dbt records that this model depends on family_members. That dependency is what builds the DAG.

If you run dbt build and family_members is out of date, dbt will rebuild it first before rebuilding the model that refs it. You never have to think about ordering.


source() — for raw tables outside dbt

dbt doesn't build the raw tables landed by os-nexus (or Fivetran, or your own scripts). Those exist before dbt runs. You reference them with source() after declaring them in a YAML file.

Declare in models/sources/gmail/_sources.yml:

version: 2

sources:
  - name: gmail
    schema: gmail
    tables:
      - name: messages
      - name: recipients

Reference from a model with source('gmail', 'messages'):

-- models/staging/gmail/stg_gmail_messages.sql
select
    message_id,
    sender_email,
    sent_at,
    subject
from {{ source('gmail', 'messages') }}

source() plays the same role as ref(): it resolves to a real table name and records the dependency for the DAG. The difference is that dbt doesn't try to build sources — they're inputs to the pipeline.

The rule of thumb: everything inside models/ is a ref(); everything outside dbt is a source().


The DAG

The collection of every ref() and source() call across your project forms a directed acyclic graph — a tree of dependencies that flows in one direction (acyclic = no loops).

   gmail.messages ─────→  stg_gmail_messages ─────→  gmail_events ─────┐
                                                                       │
   gmail.recipients ───→  stg_gmail_recipients ───→  gmail_entity_id  ─┤
                                                                       ├──→  nexus_events
   google_calendar.events ──→  google_calendar_events ─────────────────┤
                                                                       │
   notion.tasks ────→  stg_notion_tasks ───→  notion_events ───────────┘

dbt walks this graph every time you run. For each model:

  1. Resolve its ref()s and source()s
  2. Sort the whole graph topologically
  3. Build models in dependency order
  4. Parallelize where the graph allows (independent branches)

You don't have to specify run order anywhere — it falls out of the ref() calls.


Viewing the DAG

In dbt Cloud

The Lineage tab in the IDE renders the DAG visually. Click a model to highlight its upstream and downstream dependencies.

Locally with dbt Power User

The Lineage panel (added by the extension) shows the same view. Click a model in the panel to navigate to its file. Click the upstream/downstream icons to focus subgraphs.

From the CLI

dbt docs generate
dbt docs serve

Opens a browser-based docs site with full lineage browsing.

The DAG is more than a pretty picture. It's how you reason about:

  • Impact analysis — change a model, see what downstream models break
  • Debugging — a stale upstream is the most common cause of downstream errors
  • Performance — long, narrow chains are slow; wide, parallel branches run fast

Selecting subsets of the DAG

You almost never want to run every model. Use the -s (select) flag to target subsets:

# Run one model
dbt build -s family_members

# Run that model and everything downstream
dbt build -s family_members+

# Run that model and everything upstream (sources, parents, grandparents)
dbt build -s +family_members

# Both — full dependency cone around it
dbt build -s +family_members+

# Everything tagged 'nexus'
dbt build -s tag:nexus

# Multiple selectors (union)
dbt build -s tag:nexus tag:marts

# Intersection
dbt build -s tag:nexus,tag:gmail

The + is read as "and everything attached to this side". This is the single most useful tool in dbt for iterating quickly during development.


A complete two-model example

Create models/staging/family_emails.sql:

select
    member,
    email
from {{ ref('family_members') }}
where email is not null

Create models/marts/parent_emails.sql:

select email
from {{ ref('family_emails') }}
where member in (
    select member from {{ ref('family_members') }} where role = 'parent'
)

The DAG:

   family_members (seed)
        ├──→ family_emails
        │         │
        │         └──→ parent_emails
        │                    ▲
        │                    │
        └────────────────────┘

Build everything downstream of family_members:

dbt build -s family_members+

dbt builds family_members (if it's a seed it just loads CSV), then family_emails, then parent_emails. You wrote zero ordering logic.


Hands-On Exercise

  1. Create a model models/staging/active_family_members.sql that selects only role in ('parent', 'child') from {{ ref('family_members') }}.

  2. Create a model models/marts/family_email_domains.sql that selects member, split(email, '@')[offset(1)] as domain from {{ ref('active_family_members') }}.

  3. Build only the new models:

    dbt build -s active_family_members+
    
  4. Open the lineage view (dbt Cloud or Power User) and confirm the shape: family_members → active_family_members → family_email_domains.

  5. Make a tiny edit to active_family_members.sql. Then run:

    dbt build -s state:modified+
    

    dbt rebuilds the changed model and everything downstream of it. That's impact-analysis-as-a-command.


Summary

Concept Key takeaway
Model A .sql file with one select statement; becomes a table or view
ref() Reference another model — resolves to a real table name + records a dep
source() Reference a raw table outside dbt; declared in a YAML file
DAG The graph dbt builds from every ref() / source() call
Run ordering Falls out of the DAG automatically — you don't specify it
Selectors -s model+ / +model / tag:foo / state:modified+ for targeted runs

Next Lesson

ref() and the DAG explain what gets built and in what order. The next thing to learn is how to structure the SQL inside a model — and in dbt that almost always means CTEs. Head to 2.3 CTEs — the building blocks of dbt models.