Metric Definitions Should be Maintained in DBT
Your data dictionary (or semantic layer) should live in dbt — defined once, compiled to metadata on every build, and never maintained in a separate tool.
Browse a live demo of the data dictionaries we build for clients →
nexus_event_measurements4 metrics2 measure1 count1 derived| Metric | Description | Format | Tags |
|---|---|---|---|
Revenuemeasure revenue | Platform revenue on paid bookings. | Currency (USD) | financekpi |
Total Bookingscount total_bookings | Total number of confirmed bookings. | Integer | operations |
Avg Revenue / Bookingderived average_revenue_per_booking | Average revenue earned per booking. Derived from revenue divided by total bookings. | Currency (USD) | finance |
Refundsmeasure refund_amount | Total refund amounts issued. | Currency (USD) | finance |
The data team spends hundreds of hours (and thousands of dollars) turning your raw data into useful tables in the data warehouse so business users and AI tools can understand it.
But then, you ask a simple question like "what was our revenue last month?" and you're met with blank stares.
Even worse, your daily Looker Studio dashboard reports $100k while your "governance" Power BI report shows $80k. Meanwhile, your new AI chatbot reports $500k.
This is the classic "metric definition" problem. Your tables can be perfectly accurate, recording every invoice and payment, but you haven't defined what "revenue" is.
That's the problem data dictionaries, or semantic layers are supposed to solve. You add standard, plain-language business definitions like "revenue = the amount column on Stripe invoices" and back them with SQL so any dashboard, analyst, or LLM can use them.
But where should those business definitions live? A new tool? An Excel Spreadsheet? Confluence? Notion?
No. Semantic definitions should live in dbt.
- Separate tools will never stay up to date.
- Metric definitions are intimately tied to table transformations. You need to build both at the same time. The person building the table already has the excruciating details needed to define its metrics.
- AI tools don't need a constrained query API like those provided by semantic layer tools. They need rich definitions that tell them what metrics mean and how to compute them. That should live right next to the tables that will be queried. Not piped through another tool.
Here's why, and how we do this with dbt-nexus.
What Semantic Layer Tools Actually Do
Semantic Layer tools like Cube Cloud, Snowflake Cortex, and LookML handle five things:
- Define which tables exist and how they join
- Declare dimensions with types
- Declare metrics with aggregation rules
- Translate business questions into deterministic SQL
- Expose APIs, pre-aggregations, documentation
Here's the thing: responsibilities 1, 2, and 3 are transformation-layer knowledge. The engineer building the model already knows this — they just wrote it. They know the join keys. They know the column types. They know the metric formula because they built the table it references.
Only 4 and 5 are genuinely query-layer concerns — translating a question into SQL, caching the results, enforcing access controls.
The traditional semantic layer asks the engineer to declare transformation-layer knowledge a second time in a query-layer tool. That's the root of the problem.
Constraining Outputs vs. Informing Outputs
Traditional semantic layers exist to constrain. They provide a query API that only allows valid queries. This was essential when the primary consumers were BI tools — drag-and-drop interfaces that needed guardrails to prevent analysts from writing nonsensical SQL.
LLMs change this equation.
An LLM can write arbitrary SQL given good context. It doesn't need a constrained
API — it needs a rich catalog: what metrics exist, how they're computed, which
filters apply, what the dimension values look like. Given the right context, an
LLM can compute MEDIAN(daily_conversion_rate) — a nested aggregation that many
semantic layers can't even express — because the metric definition told it
everything it needs to know.
The shift is from constraining outputs to informing outputs.
This changes what the semantic layer's primary job is. It's no longer about query enforcement. It's about discoverability and documentation — a catalog that helps LLMs, analysts, and tools understand what's available and how to use it.
And that knowledge already lives in the transformation layer. It lives in the model that defines the join, the intermediate model that encodes the business logic, the YAML that names the metric. The question is whether you extract it automatically or re-type it by hand.
The Transformation Layer Is the Context Layer
The industry is converging on this realization from a different angle. Andreessen Horowitz recently published "Your Data Agents Need Context", arguing that data agents are "essentially useless without the right context" and that a new "context layer" is needed — a maintained corpus of business definitions, data source mappings, and tribal knowledge that agents can query in real time.
Their diagnosis is spot on. They describe the exact failure mode: an organization builds a data agent, asks it "what was revenue growth last quarter?", and the agent can't answer because it doesn't know how revenue is defined, which tables are canonical, or what "last quarter" means in the company's fiscal calendar. The team checks the semantic layer and finds it was last updated by someone who left a year ago. Sound familiar?
a16z frames the solution as a "modern context layer" — a superset of the semantic layer that includes canonical entities, identity resolution, business definitions, and operational instructions. They describe a multi-step process: automated context construction, human refinement, agent connection, and self-updating flows.
We agree with the framing but differ on where it should live. The a16z piece asks "where will this context layer live? Can it live in multiple places? Will it be its own standalone product?" Our answer: it should live in the transformation layer, where the knowledge originates.
Every property they describe — business definitions, canonical entities, identity resolution, data source mappings — is transformation-layer knowledge. The engineer building the dbt model is the one who knows how revenue is defined, which table is the source of truth, and how entities resolve across sources. When that knowledge is captured as YAML configuration and compiled into metadata tables on every build, the transformation layer becomes the context layer. It's automated, it's always current, and it doesn't require a separate product to maintain.
The context layer isn't a new thing to build. It's something you already have — if you capture the knowledge at the point where the engineer has maximum context.
Define Once, Compile to Metadata
Here's how dbt-nexus handles this.
Metrics: YAML in, metadata table out
nexus_event_measurements4 metrics2 measure1 count1 derived| Metric | Description | Format | Tags |
|---|---|---|---|
Revenuemeasure revenue | Platform revenue on paid bookings. | Currency (USD) | financekpi |
Total Bookingscount total_bookings | Total number of confirmed bookings. | Integer | operations |
Avg Revenue / Bookingderived average_revenue_per_booking | Average revenue earned per booking. Derived from revenue divided by total bookings. | Currency (USD) | finance |
Refundsmeasure refund_amount | Total refund amounts issued. | Currency (USD) | finance |
Metrics are defined in dbt_project.yml, right alongside the models they
reference:
vars:
nexus:
metrics:
nexus_event_measurements:
- name: revenue
label: Revenue
type: measure
tables: [nexus_event_measurements, nexus_event_dimensions]
metric_sql: nexus_event_measurements.revenue
filter:
- nexus_event_dimensions.is_revenue_earned = true
format: currency
unit: USD
description: >
Platform revenue on paid bookings.
example_questions:
- What was our revenue last month?
- What is average revenue per booking?
The engineer defining this metric has the model open. They know the column name. They know the filter. They know the join path. They're not re-declaring knowledge — they're annotating it at the source.
On every dbt build, a Jinja model
called nexus_metrics_metadata iterates
over these YAML definitions and compiles them into a queryable table:
SELECT * FROM nexus_metrics_metadata
| model | metric_name | metric_type | metric_sql | filter | description |
|---|---|---|---|---|---|
| nexus_event_measurements | revenue | measure | nexus_event_measurements.revenue | nexus_event_dimensions.is_revenue_earned… | Platform revenue on paid bookings. |
| nexus_events | conversion_rate | derived | converting_persons / NULLIF(total_persons, 0) | — | Fraction of persons who converted. |
This table is the semantic layer catalog. It's rebuilt on every dbt build. It
can never drift from the underlying models because it's generated from the same
configuration that defines them.
Dimensions: discovered, not declared
| Dimension | Description | Examples | Sources | Type | Distinct |
|---|---|---|---|---|---|
Business Unit business_unit | Operating division of the company. | dtcpartnerwholesale | stripesalesforce | string | 3 |
Channel channel | Marketing acquisition channel. | organicpaid_searchreferral | ga4facebook_ads | string | 8 |
Is Revenue Earned is_revenue_earned | Whether the revenue has been recognized per accrual rules. | truefalse | stripe | boolean | 2 |
Booking Type booking_type | Classification of the booking tier. | standardpremiumvip | stripeinternal_crm | string | 3 |
Region region | Geographic region derived from billing address. | us_eastus_westeurope | stripesalesforce | string | 5 |
Traditional semantic layers require you to declare every dimension. dbt-nexus discovers them automatically.
Every dimension_name that appears in the warehouse's EAV dimension table shows
up in nexus_event_dimensions_metadata — with warehouse-derived statistics, the
top three example values by frequency, and whether the dimension is boolean or
string-typed. No declaration step required.
You can enrich dimensions with YAML — adding descriptions, aliases, tags, and example questions — but the dimension exists in the catalog the moment data flows through it.
SELECT
dimension_name,
dimension_type,
description,
example_value_1,
example_value_2,
example_value_3,
distinct_values
FROM nexus_event_dimensions_metadata
WHERE source IS NOT NULL
ORDER BY dimension_name
An LLM querying this table sees that business_unit has example values 'dtc',
'partner', 'wholesale' and knows exactly what filter values to use — without
the engineer having to declare them in a separate config.
Join paths: fixed by schema
In the nexus compressed schema, every facet (measurements, dimensions,
identifiers, traits) joins to events on event_id. There's one join path. It
doesn't need to be re-declared in a semantic layer config because the schema
is the join definition.
What You Get for Free
Revenue
measurerevenueonnexus_event_measurements
Overview
Description
Platform revenue on paid bookings.
Example Questions
- What was our revenue last month?
- What is average revenue per booking?
- Revenue by business unit?
Also Known As
total_revenue, gross_revenue
Format
Currency (USD)
Polarity
higher is betterPrecision
0 decimal places
Technical
Table Dependencies
nexus_event_measurementsnexus_event_dimensionsFilters
nexus_event_dimensions.is_revenue_earned = true
SQL
nexus_event_measurements.revenue
Recent trend
7-day total
$31,500
Change %
12.3%When definitions live in the transformation layer, the metadata tables become the single catalog that downstream tools consume. Here's what falls out naturally:
Time-series previews. The metadata table knows which model a metric lives
on. The model has occurred_at. That's enough to generate sparklines and trend
charts — no query API needed.
Example values. The dimension metadata table derives the top three values by frequency per dimension per source, directly from the warehouse. An LLM or analyst sees concrete examples without querying the underlying data.
PR-level visibility. When a table transformation changes, the metric definitions that depend on it are in the same repo — often the same file. A single GitHub PR shows the table change and the metric update side by side. Reviewers see the full picture. Nothing silently drifts in a separate tool that nobody remembers to update.
Bidirectional traceability. Artifacts (reports, dashboards) reference metric names. The metric definition links to the models it depends on. Click a metric in a report and you land on its definition. Click the definition and you see its SQL, filters, and table dependencies.
Schema generation. The metric definitions are structured and precise enough
to deterministically generate Cube schemas,
MetricFlow YAML,
LookML views, or
any other tool-native format. A measure type maps to a tool's native measure.
A count maps to count-distinct. A derived maps to a calculated metric. The
semantic layer tool becomes a deployment target, not a source of truth.
What About Query Translation and Caching?
This isn't an argument against semantic layer tools. It's an argument about where definitions live.
Because dbt-nexus metric definitions are structured and precise — with explicit SQL expressions, table dependencies, filters, and types — they contain everything needed to auto-generate a semantic layer schema for whatever tool you pick. Switching from Cube to MetricFlow, or from LookML to a custom query engine, is a code generation task, not a re-modeling task. The transformation layer stays the source of truth. The semantic layer API becomes an execution tool — interchangeable, replaceable, and never the place where definitions are authored.
Deterministic SQL generation. For production dashboards that need guaranteed-correct queries, a semantic layer query API is valuable. But the schemas that power it can be generated from the same YAML definitions that populate the metadata table. One definition, two outputs.
Pre-aggregation and caching. Materialized rollups are a performance optimization — a deployment concern, not a modeling concern. Tools like Cube handle this well. The rollup config can be generated alongside the schema.
Access control. Row-level and column-level security are enforcement mechanisms. The policies can be generated from the same metadata that describes what the data is. The transformation layer says "this is revenue"; the deployment layer says "only these roles can see it."
The position is simple: use semantic layer tools for execution. Use the transformation layer for definitions.
The Practical Test
When you add a new metric, how many files do you edit?
If the answer is one YAML block in your dbt project — and the metadata table, the browsable catalog, the LLM context, and (optionally) the Cube schema all update on the next build — then definitions live in the right place.
If the answer is one dbt model, one Cube schema file, one LookML view, and a dashboard config, you have a duplication problem that will eventually cause drift. Not because anyone is careless, but because maintaining the same knowledge in multiple places is inherently fragile.
Conclusion
The semantic layer is not a tool. It's a set of responsibilities.
Some belong in the transformation layer: metric definitions, dimension discovery, join paths, qualitative metadata. These are things the engineer knows at the moment they write the model. Capturing them there — in YAML, in the same PR, compiled to metadata on every build — eliminates the drift problem entirely.
Others belong in the query layer: SQL generation, caching, access control, pre-aggregation. These are execution concerns that semantic layer tools handle well. But they should consume definitions, not own them.
When you keep definitions in the transformation layer and generate everything else, you get a semantic layer that's always in sync, discoverable by default, and tool-agnostic.
The best semantic layer is one you never have to maintain.
For the full technical reference on how dbt-nexus metric definitions work, see the Metrics documentation.