Seeding the Doe family data

Use dbt seeds to load the Doe family roster from a CSV — a shortcut to get real data in the warehouse before we set up ingestion in Module 3.

Learning Objectives

By the end of this lesson, you will be able to:

  • Explain what dbt seeds are and when to use them
  • Load a CSV into BigQuery with dbt seed
  • Reference a seed from a model with {{ ref('seed_name') }}
  • Understand why we're starting with seeds instead of ingested source data

Why start with seeds?

You're about to spend the next several lessons learning the core dbt concepts — ref(), the DAG, materializations, Jinja, macros. Each of those lessons builds and runs real models in your warehouse.

But there's a chicken-and-egg problem: you don't have any real source data yet. Ingestion (Gmail, Calendar, Notion via os-nexus) is the prerequisite for Module 3, not Module 1 — it requires standing up the Nexus app, configuring Nango, running initial syncs. We'll get to it, but not before you understand what dbt does.

Seeds are the shortcut. dbt lets you check small CSV files into your project and load them into the warehouse with a single command. They give you real, queryable data to practice against — without any ingestion infrastructure. Once you have ingested data later, seeds keep their place as the canonical home for small, hand-maintained reference tables.

The seed you'll create here — a five-row roster of Doe family members — gets used in every example and practicum throughout the rest of this module, and is the foundation of the is_family_member computed trait in Module 3.


What seeds are good for

Seeds are CSVs in your dbt project. dbt seed loads each one into the warehouse as a table.

Use seeds for:

  • Reference data — country codes, role labels, status mappings
  • Hand-curated lookups — your contacts' mailing addresses, family member roster, manually labelled training data
  • Small fixtures — sample data for development before ingestion is wired up (← this lesson)

Don't use seeds for:

  • Large datasets — checking a 100 MB CSV into Git is a bad idea
  • Frequently changing data — every change is a commit
  • Anything that should come from an API — that's what ingestion is for

The Doe family roster is a perfect fit: it changes maybe once a year when someone moves out, and five rows in Git is exactly the right size.


Step 1: Create the project structure

If you don't already have the standard layout from earlier lessons:

mkdir -p models/sources models/staging models/marts seeds

Your project root should now contain seeds/ alongside models/.


Step 2: Add the family_members.csv seed

Create seeds/family_members.csv:

member,email,role
Jane,jane@doefamily.example,parent
John,john@doefamily.example,parent
Jack,jack@doefamily.example,child
Joe,joe@doefamily.example,child
Julie,julie@doefamily.example,child

A few conventions worth noting:

  • Lowercase, snake_case column headers — same naming style as your models, easier to write SQL against
  • One row per family member, with a meaningful business key (member in this case)
  • role lets us distinguish parents from kids later in queries

Step 3: Load the seed

From your project root:

dbt seed

That loads every CSV in seeds/ into your dev dataset (doe_family_dev). You should see one row per seed in the output:

1 of 1 OK loaded seed file doe_family_dev.family_members ... [INSERT 5 in 0.42s]

To load a specific seed:

dbt seed -s family_members

To reload (in case you edited the CSV):

dbt seed -s family_members --full-refresh

Without --full-refresh, dbt does an incremental merge by default — that's fine for normal use but the full refresh is sometimes faster for tiny seeds.


Step 4: Verify in BigQuery

Open the BigQuery console, navigate to doe-family-dwh.doe_family_dev, and you should see a family_members table with five rows.

Or query it directly:

select * from `doe-family-dwh`.`doe_family_dev`.`family_members`
order by role, member;

Step 5: Reference the seed from a model

Seeds work with {{ ref(...) }} just like models do. Create models/marts/parent_emails.sql:

select email
from {{ ref('family_members') }}
where role = 'parent'

Build it:

dbt build -s parent_emails

Confirm in BigQuery — two rows (Jane and John).

This is the entire reason seeds are powerful: as far as dbt is concerned, a seed is just a model that happens to be a CSV. It participates in the DAG, gets ref()'d the same way, and downstream models depend on it. We'll lean on this exact pattern for the rest of the module.


Configuring seed columns (optional but useful)

By default dbt infers column types from the CSV. Sometimes the inference is wrong (e.g., a column of all '01', '02' gets read as integers, losing the leading zeros). You can pin column types in dbt_project.yml:

seeds:
  doe_family:
    family_members:
      +column_types:
        member: string
        email: string
        role: string

For five rows of obvious strings this is overkill, but it's the move you'll want for any seed where type inference might misbehave.


Hands-On Exercise

  1. Add the family_members.csv seed shown above.

  2. Run dbt seed and confirm it loaded.

  3. Query the resulting family_members table in BigQuery.

  4. Build the parent_emails model and confirm it returns two rows.

  5. Add a second seed seeds/family_pets.csv with at least three rows (your call on the columns — Fluffy the cat, Rex the dog, etc.). Run dbt seed -s family_pets and verify the table.

  6. Create a tiny model models/marts/who_has_pets.sql that joins family_members to family_pets on a column you defined (e.g., owner matching member). Build it.


Summary

Concept Key takeaway
Seeds CSV files in seeds/ that dbt loads as tables
When to use Reference data, hand-curated lookups, small dev fixtures
Why start here Real, queryable data without needing ingestion infrastructure
Loading dbt seed (all) or dbt seed -s name (one)
Reloading Add --full-refresh after editing a CSV
Reference syntax {{ ref('seed_name') }} — same as a model
Column types Configure in dbt_project.yml if type inference is wrong
Doe family seed family_members.csv — used in every example for the rest of Module 2

Next Lesson

You have data in the warehouse and you can reference it from a model. Time to dig into how dbt actually thinks about models, dependencies, and the build graph. Head to 2.2 Creating models, refs, and lineage.