Seeding the Doe family data
Use dbt seeds to load the Doe family roster from a CSV — a shortcut to get real data in the warehouse before we set up ingestion in Module 3.
Learning Objectives
By the end of this lesson, you will be able to:
- Explain what dbt seeds are and when to use them
- Load a CSV into BigQuery with
dbt seed - Reference a seed from a model with
{{ ref('seed_name') }} - Understand why we're starting with seeds instead of ingested source data
Why start with seeds?
You're about to spend the next several lessons learning the core dbt
concepts — ref(), the DAG, materializations, Jinja, macros. Each of
those lessons builds and runs real models in your warehouse.
But there's a chicken-and-egg problem: you don't have any real source data yet. Ingestion (Gmail, Calendar, Notion via os-nexus) is the prerequisite for Module 3, not Module 1 — it requires standing up the Nexus app, configuring Nango, running initial syncs. We'll get to it, but not before you understand what dbt does.
Seeds are the shortcut. dbt lets you check small CSV files into your project and load them into the warehouse with a single command. They give you real, queryable data to practice against — without any ingestion infrastructure. Once you have ingested data later, seeds keep their place as the canonical home for small, hand-maintained reference tables.
The seed you'll create here — a five-row roster of Doe family members —
gets used in every example and practicum throughout the rest of this
module, and is the foundation of the is_family_member computed trait
in Module 3.
What seeds are good for
Seeds are CSVs in your dbt project. dbt seed loads each one into the
warehouse as a table.
Use seeds for:
- Reference data — country codes, role labels, status mappings
- Hand-curated lookups — your contacts' mailing addresses, family member roster, manually labelled training data
- Small fixtures — sample data for development before ingestion is wired up (← this lesson)
Don't use seeds for:
- Large datasets — checking a 100 MB CSV into Git is a bad idea
- Frequently changing data — every change is a commit
- Anything that should come from an API — that's what ingestion is for
The Doe family roster is a perfect fit: it changes maybe once a year when someone moves out, and five rows in Git is exactly the right size.
Step 1: Create the project structure
If you don't already have the standard layout from earlier lessons:
mkdir -p models/sources models/staging models/marts seeds
Your project root should now contain seeds/ alongside models/.
Step 2: Add the family_members.csv seed
Create seeds/family_members.csv:
member,email,role
Jane,jane@doefamily.example,parent
John,john@doefamily.example,parent
Jack,jack@doefamily.example,child
Joe,joe@doefamily.example,child
Julie,julie@doefamily.example,child
A few conventions worth noting:
- Lowercase, snake_case column headers — same naming style as your models, easier to write SQL against
- One row per family member, with a meaningful business key (
memberin this case) rolelets us distinguish parents from kids later in queries
Step 3: Load the seed
From your project root:
dbt seed
That loads every CSV in seeds/ into your dev dataset (doe_family_dev).
You should see one row per seed in the output:
1 of 1 OK loaded seed file doe_family_dev.family_members ... [INSERT 5 in 0.42s]
To load a specific seed:
dbt seed -s family_members
To reload (in case you edited the CSV):
dbt seed -s family_members --full-refresh
Without --full-refresh, dbt does an incremental merge by default —
that's fine for normal use but the full refresh is sometimes faster
for tiny seeds.
Step 4: Verify in BigQuery
Open the BigQuery console, navigate to doe-family-dwh.doe_family_dev,
and you should see a family_members table with five rows.
Or query it directly:
select * from `doe-family-dwh`.`doe_family_dev`.`family_members`
order by role, member;
Step 5: Reference the seed from a model
Seeds work with {{ ref(...) }} just like models do. Create
models/marts/parent_emails.sql:
select email
from {{ ref('family_members') }}
where role = 'parent'
Build it:
dbt build -s parent_emails
Confirm in BigQuery — two rows (Jane and John).
This is the entire reason seeds are powerful: as far as dbt is
concerned, a seed is just a model that happens to be a CSV. It
participates in the DAG, gets ref()'d the same way, and downstream
models depend on it. We'll lean on this exact pattern for the rest of
the module.
Configuring seed columns (optional but useful)
By default dbt infers column types from the CSV. Sometimes the
inference is wrong (e.g., a column of all '01', '02' gets read as
integers, losing the leading zeros). You can pin column types in
dbt_project.yml:
seeds:
doe_family:
family_members:
+column_types:
member: string
email: string
role: string
For five rows of obvious strings this is overkill, but it's the move you'll want for any seed where type inference might misbehave.
Hands-On Exercise
-
Add the
family_members.csvseed shown above. -
Run
dbt seedand confirm it loaded. -
Query the resulting
family_memberstable in BigQuery. -
Build the
parent_emailsmodel and confirm it returns two rows. -
Add a second seed
seeds/family_pets.csvwith at least three rows (your call on the columns — Fluffy the cat, Rex the dog, etc.). Rundbt seed -s family_petsand verify the table. -
Create a tiny model
models/marts/who_has_pets.sqlthat joinsfamily_memberstofamily_petson a column you defined (e.g.,ownermatchingmember). Build it.
Summary
| Concept | Key takeaway |
|---|---|
| Seeds | CSV files in seeds/ that dbt loads as tables |
| When to use | Reference data, hand-curated lookups, small dev fixtures |
| Why start here | Real, queryable data without needing ingestion infrastructure |
| Loading | dbt seed (all) or dbt seed -s name (one) |
| Reloading | Add --full-refresh after editing a CSV |
| Reference syntax | {{ ref('seed_name') }} — same as a model |
| Column types | Configure in dbt_project.yml if type inference is wrong |
| Doe family seed | family_members.csv — used in every example for the rest of Module 2 |
Next Lesson
You have data in the warehouse and you can reference it from a model. Time to dig into how dbt actually thinks about models, dependencies, and the build graph. Head to 2.2 Creating models, refs, and lineage.