Prerequisite — Ingesting source data with os-nexus

The standard ingestion path for nexus projects — os-nexus uses Nango under the hood to land raw Gmail, Calendar, and Notion data into BigQuery so the rest of Module 3 has real data to read.

Module 2 used seed data to keep things moving without needing real source data. Module 3 is where actual source data starts to matter, so this lesson covers the ingestion side before we get into nexus itself.

Learning Objectives

By the end of this lesson, you will be able to:

  • Describe what os-nexus is and what role it plays in the pipeline
  • Connect Gmail, Google Calendar, and Notion via Nango OAuth
  • Configure os-nexus to write into your BigQuery project (it creates per-source datasets automatically)
  • Recognize when to use os-nexus vs. an alternative ingestion path

Where ingestion fits

Recall the high-level flow:

   Source APIs  ─→  os-nexus  ─→  raw tables in BigQuery  ─→  dbt + dbt-nexus  ─→  modeled outputs
   (Gmail,       (this lesson)    (gmail, google_calendar, notion     (Modules 1 & 2)
    Calendar,                      datasets, created automatically on
    Notion)                        first sync)

os-nexus is the ingestion half — the part that goes out to the APIs, pulls data on a schedule, and writes it into BigQuery. dbt-nexus (Module 3) is the transformation half — the part that turns the raw landed data into resolved entities and a unified event log.

They're separate tools. The course's job is teaching you dbt-nexus. Ingestion has to happen somehow — os-nexus is the standard way nexus clients do it, and that's what we'll set up here.


What is os-nexus?

os-nexus ("operating system nexus") is the ingestion pipeline that ships with the Nexus platform. It:

  • Authenticates to source APIs (Gmail, Google Calendar, Notion, etc.) via Nango, an OAuth proxy
  • Pulls data on a configurable sync schedule (full refresh, incremental, or webhook-driven)
  • Writes raw API records into BigQuery, one dataset per source, with the dataset named after the source (gmail, google_calendar, notion)
  • Standardizes how records land — every table has the same set of underscore-prefixed metadata columns (_raw_record, _ingested_at, _connection_id, etc.) plus the source-specific tables underneath

Under the hood it's a Convex-based ETL service that lives in the app/ directory of the main Nexus monorepo.

For a deeper reference on os-nexus and the template dbt sources that consume it, see Template Sources in the dbt-nexus docs.


Step 1: Get access to a Nexus app instance

You have two options:

  1. Self-host — clone the Nexus monorepo, deploy Convex, deploy Nango, point os-nexus at your BigQuery project. This is the path most internal teams take.

  2. Use a managed Nexus instance — if your organization already runs a Nexus app, ask whoever administers it to add your BigQuery project as a connected warehouse.

For the Doe family scenario (a single household), the simplest path is the managed instance. For a real production deployment, self-hosting gives you full control.


Step 2: Connect your BigQuery project as a warehouse

In the Nexus app, the warehouse config lives at app/convex/config/warehouses.ts. A new warehouse entry looks like:

export const warehouses = [
  {
    id: 'doe-family-bq',
    type: 'bigquery',
    projectId: 'doe-family-dwh',
    serviceAccountKey: process.env.DOE_FAMILY_BQ_KEY,
  },
]

The serviceAccountKey references the JSON key you created in lesson 0.1 (typically loaded from an environment variable rather than committed). Each source's data will land in its own BigQuery dataset named after the source (gmail, google_calendar, notion) — see clients/slide-rule-tech/models/sources/notion/schema.yml for a real example of how the dbt project then references it.


Step 3: Connect each source via Nango

For each source (Gmail, Google Calendar, Notion):

  1. In the Nexus app UI, open the Connections view
  2. Click Add connection → select the integration (Gmail / Google Calendar / Notion)
  3. Nango walks you through OAuth in your browser
  4. The destination dataset is created automatically and named after the source (gmail for the Gmail connection, google_calendar for Calendar, notion for Notion) — you don't pick or pre-create it
  5. Choose the sync schedule (e.g., "every 6 hours" for the Doe family; a real production deployment might do hourly)

You'll repeat this three times — once per source.


Step 4: Trigger an initial sync

Once each connection is configured, trigger a manual sync to backfill data. From the connection view, click Sync now, or via the API:

POST https://<your-nexus-app>/nango/sync
Content-Type: application/json

{ "connectionId": "doe-family-gmail", "triggerSync": true }

The first sync may take a while depending on how much historical data you're pulling. Subsequent syncs are incremental.


Step 5: Verify the raw tables landed

Open the BigQuery console and inspect each dataset:

-- What landed in the gmail dataset
select table_name, row_count
from `doe-family-dwh.gmail.__TABLES__`
order by row_count desc;

You should see tables like messages, recipients, labels populated with rows. Similar tables show up in google_calendar and notion.

If you don't see datasets or tables yet, check the Nexus app's sync logs — the most common issue is the service account lacking the BigQuery Data Owner role (needed for first-run dataset creation) or BigQuery Data Editor (needed to write rows).


Schema you can expect

Every os-nexus table follows a similar shape — the source's own payload, plus a fixed set of underscore-prefixed metadata columns:

Column Type Notes
_raw_record STRING / JSON Full source API response, as JSON
_ingested_at TIMESTAMP When the row was written by the sync
_connection_id STRING Which Nango provider connection produced it
_stream_id STRING Stream identifier (often the source object ID)
_sync_id STRING ID of the sync run that produced this row

(See clients/slide-rule-tech/models/sources/notion/schema.yml for a real example of how these are documented.)

You'll typically write small staging models in your dbt project that extract typed fields out of _raw_record (JSON path extraction). We don't dwell on staging in this course; lesson 3.4 jumps straight to the nexus-shaped models that consume those flattened tables.


When to use something other than os-nexus

os-nexus is the standard choice, but it's not the only one. Other paths people take:

Ingestion option When it's a good fit
os-nexus You want the canonical nexus pipeline; you have or can run Convex/Nango
Fivetran You want a fully managed SaaS, willing to pay per-row
Airbyte You want open-source + connector breadth, willing to operate it
Custom Cloud Functions You only need one or two sources, want full control
Cloud Composer / Airflow You already operate Airflow and want to fit nexus into it

All of them produce the same end result for the course: raw tables in your BigQuery datasets. From dbt's point of view, the source is just "some table that exists" — it doesn't care how it got there.


Hands-On Exercise

Pick one of these depending on your situation:

  1. You have a Nexus app instance — connect Gmail, Calendar, and Notion through it, trigger initial syncs, and verify the automatically-created gmail, google_calendar, and notion datasets appear in BigQuery with rows in them.

  2. You don't have a Nexus instance and don't want to set one up — use Fivetran's 14-day free trial, or write a one-off Python script to load a small sample of your real data into datasets named gmail, google_calendar, and notion. The rest of the course works regardless of how the data got there.

  3. You're using made-up sample data — manually create the gmail, google_calendar, and notion datasets and populate a few rows in each that match the schema lesson 3.4 expects (messages.message_id, messages.sender_email, etc.). This is the lowest-investment way to follow the rest of the course.


Summary

Concept Key takeaway
os-nexus The standard ingestion pipeline for nexus projects; lives in app/
Nango OAuth proxy os-nexus uses to authenticate to source APIs
What it produces Raw API records in BigQuery, one row per source object
Standard table shape _raw_record (JSON) + _ingested_at, _connection_id, _stream_id, _sync_id
Where it fits Between source APIs and dbt — produces the tables Module 3 reads from
Alternatives Fivetran, Airbyte, custom Cloud Functions — all produce the same end state

Next Lesson

With raw data landing in your BigQuery datasets, you can start building the nexus side of the project. Head to 3.2 Introduction to dbt-nexus.