Metric Definitions Should be Maintained in DBT

Browse a live demo of the data dictionaries we build for clients →

nexus_event_measurements4 metrics2 measure1 count1 derived

Metric	Description	Format	Tags
Revenuemeasure `revenue`	Platform revenue on paid bookings.	Currency (USD)	financekpi
Total Bookingscount `total_bookings`	Total number of confirmed bookings.	Integer	operations
Avg Revenue / Bookingderived `average_revenue_per_booking`	Average revenue earned per booking. Derived from revenue divided by total bookings.	Currency (USD)	finance
Refundsmeasure `refund_amount`	Total refund amounts issued.	Currency (USD)	finance

The data team spends hundreds of hours (and thousands of dollars) turning your raw data into useful tables in the data warehouse so business users and AI tools can understand it.

But then, you ask a simple question like "what was our revenue last month?" and you're met with blank stares.

Even worse, your daily Looker Studio dashboard reports $100k while your "governance" Power BI report shows $80k. Meanwhile, your new AI chatbot reports $500k.

This is the classic "metric definition" problem. Your tables can be perfectly accurate, recording every invoice and payment, but you haven't defined what "revenue" is.

That's the problem data dictionaries, or semantic layers are supposed to solve. You add standard, plain-language business definitions like "revenue = the amount column on Stripe invoices" and back them with SQL so any dashboard, analyst, or LLM can use them.

But where should those business definitions live? A new tool? An Excel Spreadsheet? Confluence? Notion?

No. Semantic definitions should live in dbt.

Separate tools will never stay up to date.
Metric definitions are intimately tied to table transformations. You need to build both at the same time. The person building the table already has the excruciating details needed to define its metrics.
AI tools don't need a constrained query API like those provided by semantic layer tools. They need rich definitions that tell them what metrics mean and how to compute them. That should live right next to the tables that will be queried. Not piped through another tool.

Here's why, and how we do this with dbt-nexus.

What Semantic Layer Tools Actually Do

Semantic Layer tools like Cube Cloud, Snowflake Cortex, and LookML handle five things:

Define which tables exist and how they join
Declare dimensions with types
Declare metrics with aggregation rules
Translate business questions into deterministic SQL
Expose APIs, pre-aggregations, documentation

Here's the thing: responsibilities 1, 2, and 3 are transformation-layer knowledge. The engineer building the model already knows this — they just wrote it. They know the join keys. They know the column types. They know the metric formula because they built the table it references.

Only 4 and 5 are genuinely query-layer concerns — translating a question into SQL, caching the results, enforcing access controls.

The traditional semantic layer asks the engineer to declare transformation-layer knowledge a second time in a query-layer tool. That's the root of the problem.

Constraining Outputs vs. Informing Outputs

Traditional semantic layers exist to constrain. They provide a query API that only allows valid queries. This was essential when the primary consumers were BI tools — drag-and-drop interfaces that needed guardrails to prevent analysts from writing nonsensical SQL.

LLMs change this equation.

An LLM can write arbitrary SQL given good context. It doesn't need a constrained API — it needs a rich catalog: what metrics exist, how they're computed, which filters apply, what the dimension values look like. Given the right context, an LLM can compute MEDIAN(daily_conversion_rate) — a nested aggregation that many semantic layers can't even express — because the metric definition told it everything it needs to know.

The shift is from constraining outputs to informing outputs.

This changes what the semantic layer's primary job is. It's no longer about query enforcement. It's about discoverability and documentation — a catalog that helps LLMs, analysts, and tools understand what's available and how to use it.

And that knowledge already lives in the transformation layer. It lives in the model that defines the join, the intermediate model that encodes the business logic, the YAML that names the metric. The question is whether you extract it automatically or re-type it by hand.

The Transformation Layer Is the Context Layer

The industry is converging on this realization from a different angle. Andreessen Horowitz recently published "Your Data Agents Need Context", arguing that data agents are "essentially useless without the right context" and that a new "context layer" is needed — a maintained corpus of business definitions, data source mappings, and tribal knowledge that agents can query in real time.

Their diagnosis is spot on. They describe the exact failure mode: an organization builds a data agent, asks it "what was revenue growth last quarter?", and the agent can't answer because it doesn't know how revenue is defined, which tables are canonical, or what "last quarter" means in the company's fiscal calendar. The team checks the semantic layer and finds it was last updated by someone who left a year ago. Sound familiar?

a16z frames the solution as a "modern context layer" — a superset of the semantic layer that includes canonical entities, identity resolution, business definitions, and operational instructions. They describe a multi-step process: automated context construction, human refinement, agent connection, and self-updating flows.

We agree with the framing but differ on where it should live. The a16z piece asks "where will this context layer live? Can it live in multiple places? Will it be its own standalone product?" Our answer: it should live in the transformation layer, where the knowledge originates.

Every property they describe — business definitions, canonical entities, identity resolution, data source mappings — is transformation-layer knowledge. The engineer building the dbt model is the one who knows how revenue is defined, which table is the source of truth, and how entities resolve across sources. When that knowledge is captured as YAML configuration and compiled into metadata tables on every build, the transformation layer becomes the context layer. It's automated, it's always current, and it doesn't require a separate product to maintain.

The context layer isn't a new thing to build. It's something you already have — if you capture the knowledge at the point where the engineer has maximum context.

Define Once, Compile to Metadata

Here's how dbt-nexus handles this.

Metrics: YAML in, metadata table out

nexus_event_measurements4 metrics2 measure1 count1 derived

Metric	Description	Format	Tags
Revenuemeasure `revenue`	Platform revenue on paid bookings.	Currency (USD)	financekpi
Total Bookingscount `total_bookings`	Total number of confirmed bookings.	Integer	operations
Avg Revenue / Bookingderived `average_revenue_per_booking`	Average revenue earned per booking. Derived from revenue divided by total bookings.	Currency (USD)	finance
Refundsmeasure `refund_amount`	Total refund amounts issued.	Currency (USD)	finance

Metrics are defined in dbt_project.yml, right alongside the models they reference:

vars:
  nexus:
    metrics:
      nexus_event_measurements:
        - name: revenue
          label: Revenue
          type: measure
          tables: [nexus_event_measurements, nexus_event_dimensions]
          metric_sql: nexus_event_measurements.revenue
          filter:
            - nexus_event_dimensions.is_revenue_earned = true
          format: currency
          unit: USD
          description: >
            Platform revenue on paid bookings.
          example_questions:
            - What was our revenue last month?
            - What is average revenue per booking?

The engineer defining this metric has the model open. They know the column name. They know the filter. They know the join path. They're not re-declaring knowledge — they're annotating it at the source.

On every dbt build, a Jinja model called nexus_metrics_metadata iterates over these YAML definitions and compiles them into a queryable table:

SELECT * FROM nexus_metrics_metadata

model	metric_name	metric_type	metric_sql	filter	description
nexus_event_measurements	revenue	measure	nexus_event_measurements.revenue	nexus_event_dimensions.is_revenue_earned…	Platform revenue on paid bookings.
nexus_events	conversion_rate	derived	converting_persons / NULLIF(total_persons, 0)	—	Fraction of persons who converted.

This table is the semantic layer catalog. It's rebuilt on every dbt build. It can never drift from the underlying models because it's generated from the same configuration that defines them.

Dimensions: discovered, not declared

Dimension	Description	Examples	Sources	Type	Distinct
Business Unit`business_unit`	Operating division of the company.	`dtcpartnerwholesale`	stripesalesforce	string	3
Channel`channel`	Marketing acquisition channel.	`organicpaid_searchreferral`	ga4facebook_ads	string	8
Is Revenue Earned`is_revenue_earned`	Whether the revenue has been recognized per accrual rules.	`truefalse`	stripe	boolean	2
Booking Type`booking_type`	Classification of the booking tier.	`standardpremiumvip`	stripeinternal_crm	string	3
Region`region`	Geographic region derived from billing address.	`us_eastus_westeurope`	stripesalesforce	string	5

Traditional semantic layers require you to declare every dimension. dbt-nexus discovers them automatically.

Every dimension_name that appears in the warehouse's EAV dimension table shows up in nexus_event_dimensions_metadata — with warehouse-derived statistics, the top three example values by frequency, and whether the dimension is boolean or string-typed. No declaration step required.

You can enrich dimensions with YAML — adding descriptions, aliases, tags, and example questions — but the dimension exists in the catalog the moment data flows through it.

SELECT
    dimension_name,
    dimension_type,
    description,
    example_value_1,
    example_value_2,
    example_value_3,
    distinct_values
FROM nexus_event_dimensions_metadata
WHERE source IS NOT NULL
ORDER BY dimension_name

An LLM querying this table sees that business_unit has example values 'dtc', 'partner', 'wholesale' and knows exactly what filter values to use — without the engineer having to declare them in a separate config.

Join paths: fixed by schema

In the nexus compressed schema, every facet (measurements, dimensions, identifiers, traits) joins to events on event_id. There's one join path. It doesn't need to be re-declared in a semantic layer config because the schema is the join definition.

What You Get for Free

Revenue

measure

revenueonnexus_event_measurements

financekpi

Overview

Description

Platform revenue on paid bookings.

Example Questions

What was our revenue last month?
What is average revenue per booking?
Revenue by business unit?

Also Known As

total_revenue, gross_revenue

Format

Currency (USD)

Polarity

higher is better

Precision

0 decimal places

Technical

Table Dependencies

nexus_event_measurementsnexus_event_dimensions

Filters

nexus_event_dimensions.is_revenue_earned = true

SQL

nexus_event_measurements.revenue

Recent trend

7-day total

$31,500

Change %

12.3%

When definitions live in the transformation layer, the metadata tables become the single catalog that downstream tools consume. Here's what falls out naturally:

Time-series previews. The metadata table knows which model a metric lives on. The model has occurred_at. That's enough to generate sparklines and trend charts — no query API needed.

Example values. The dimension metadata table derives the top three values by frequency per dimension per source, directly from the warehouse. An LLM or analyst sees concrete examples without querying the underlying data.

PR-level visibility. When a table transformation changes, the metric definitions that depend on it are in the same repo — often the same file. A single GitHub PR shows the table change and the metric update side by side. Reviewers see the full picture. Nothing silently drifts in a separate tool that nobody remembers to update.

Bidirectional traceability. Artifacts (reports, dashboards) reference metric names. The metric definition links to the models it depends on. Click a metric in a report and you land on its definition. Click the definition and you see its SQL, filters, and table dependencies.

Schema generation. The metric definitions are structured and precise enough to deterministically generate Cube schemas, MetricFlow YAML, LookML views, or any other tool-native format. A measure type maps to a tool's native measure. A count maps to count-distinct. A derived maps to a calculated metric. The semantic layer tool becomes a deployment target, not a source of truth.

What About Query Translation and Caching?

This isn't an argument against semantic layer tools. It's an argument about where definitions live.

Because dbt-nexus metric definitions are structured and precise — with explicit SQL expressions, table dependencies, filters, and types — they contain everything needed to auto-generate a semantic layer schema for whatever tool you pick. Switching from Cube to MetricFlow, or from LookML to a custom query engine, is a code generation task, not a re-modeling task. The transformation layer stays the source of truth. The semantic layer API becomes an execution tool — interchangeable, replaceable, and never the place where definitions are authored.

Deterministic SQL generation. For production dashboards that need guaranteed-correct queries, a semantic layer query API is valuable. But the schemas that power it can be generated from the same YAML definitions that populate the metadata table. One definition, two outputs.

Pre-aggregation and caching. Materialized rollups are a performance optimization — a deployment concern, not a modeling concern. Tools like Cube handle this well. The rollup config can be generated alongside the schema.

Access control. Row-level and column-level security are enforcement mechanisms. The policies can be generated from the same metadata that describes what the data is. The transformation layer says "this is revenue"; the deployment layer says "only these roles can see it."

The position is simple: use semantic layer tools for execution. Use the transformation layer for definitions.

The Practical Test

When you add a new metric, how many files do you edit?

If the answer is one YAML block in your dbt project — and the metadata table, the browsable catalog, the LLM context, and (optionally) the Cube schema all update on the next build — then definitions live in the right place.

If the answer is one dbt model, one Cube schema file, one LookML view, and a dashboard config, you have a duplication problem that will eventually cause drift. Not because anyone is careless, but because maintaining the same knowledge in multiple places is inherently fragile.

Conclusion

The semantic layer is not a tool. It's a set of responsibilities.

Some belong in the transformation layer: metric definitions, dimension discovery, join paths, qualitative metadata. These are things the engineer knows at the moment they write the model. Capturing them there — in YAML, in the same PR, compiled to metadata on every build — eliminates the drift problem entirely.

Others belong in the query layer: SQL generation, caching, access control, pre-aggregation. These are execution concerns that semantic layer tools handle well. But they should consume definitions, not own them.

When you keep definitions in the transformation layer and generate everything else, you get a semantic layer that's always in sync, discoverable by default, and tool-agnostic.

The best semantic layer is one you never have to maintain.

For the full technical reference on how dbt-nexus metric definitions work, see the Metrics documentation.