Measurements & Semantic Layer Notes

Working notes from the Go Koala measurements implementation. Captures architectural insights about measurements, dimensions, semantic layers, and where business logic belongs.

Working notes from building Go Koala booking measurements. These capture architectural insights that should inform the measurements docs and potentially the nexus package design.


Measurements Are Semantic Labels, Not Metric Definitions

Measurements and metrics are distinct concepts that should not be conflated.

  • Measurement: A labeled quantitative observation on an event. "This booking's revenue value is $90.30." The measurement names the value and gives it a unit. It does not say anything about how to aggregate it or which events to aggregate over.
  • Metric: An aggregation rule over measurements. "Total Revenue = SUM(revenue) WHERE is_revenue_generating." The metric specifies the aggregation function, the measurement to aggregate, and the filter criteria.

Measurements live in the data (nexus). Metrics live in the query layer (semantic layer, LLM, analyst).

Example: revenue on booking events

Every booking event (created, confirmed, paid, canceled, declined) carries a revenue measurement -- Koala's commission on that booking. The number is the same regardless of event type because it describes the booking's financial terms. What changes is the lifecycle context:

  • On booking created: expected/pipeline revenue
  • On booking paid: realized revenue
  • On booking canceled: lost revenue

The measurement layer does not decide which of these "counts" as revenue. That is the metric layer's job. A metric definition says: "Realized Revenue = SUM(revenue) WHERE is_revenue_generating." The measurement just labels the value.

This matters because baking metric logic into measurements (e.g., only putting revenue on paid events) conflates two concerns and limits what downstream consumers can do. An analyst who wants "expected revenue from bookings created this month" can't get it if revenue only exists on paid events.


Two Functions of Measurements

Measurements serve two distinct labeling functions:

1. Source-Specific Labeling

Name what the value literally is in the source's domain. These are comprehensible within a single source and preserve domain meaning.

Examples:

  • koala_commission -- Koala's commission cut on a booking
  • service_fee -- fee charged to the guest
  • owner_payout -- amount paid to the property owner
  • num_nights -- stay duration

2. Cross-Source Labeling (Aliasing)

Alias a source-specific value to a standard business concept that is consistent across sources. This is the core cross-source unification function.

Examples:

  • revenue -- "this is how we calculate revenue for this source's events"
    • Koala: alias of koala_commission (commission amount)
    • Stripe: alias of invoice total / 100
    • FreshBooks: alias of payment amount
    • Membership platform: alias of dues amount
  • tax -- Stripe: invoice tax / 100. FreshBooks: tax_amount_1.
  • processing_fee -- Koala: stripe_fee. Stripe: fee amount.

The source-specific name and the cross-source alias can coexist as separate measurement rows for the same value. The measurement model produces two rows:

event_id: evt_001, measurement_name: 'koala_commission', value: 90.30, unit: 'usd'
event_id: evt_001, measurement_name: 'revenue',          value: 90.30, unit: 'usd'

Same number, different semantic labels. koala_commission is what the value literally is. revenue is what business concept it maps to. An LLM querying "total revenue" uses the alias. An analyst investigating "Koala's commission structure" uses the source-specific name.

Not every measurement needs both. num_nights is just a source-specific measurement with no cross-source counterpart (unless multiple sources track stay duration). revenue is always a cross-source alias. Cross-source standard names (revenue, tax, expense, processing_fee) should be documented as nexus conventions.


Event Dimensions: The Missing Facet

Measurements solve cross-source labeling for quantitative values. But there is a parallel need for cross-source labeling of categorical/boolean event properties -- event dimensions.

The Problem

If an analyst asks "What was total revenue?", they need two pieces of information:

  1. Which measurement column? -- revenue (measurements solve this)
  2. Which events to aggregate over? -- ??? (nothing currently solves this cross-source)

Without a cross-source dimension, the analyst must know that for Koala it's event_name = 'booking paid', for Stripe it's event_name = 'invoice paid' or event_name = 'payment succeeded' (but only for payments without invoices), for FreshBooks it's event_name = 'payment received'. That is source-specific knowledge -- the exact thing measurements exist to eliminate.

The Solution: nexus_event_dimensions

Event dimensions follow the same EAV facet pattern as measurements, identifiers, traits, and states:

Event: booking paid (evt_001)
  -> dimension: 'is_revenue_generating', value: true

Event: booking created (evt_002)
  -> (no is_revenue_generating dimension row)

Event: stripe payment succeeded (evt_003, no invoice)
  -> dimension: 'is_revenue_generating', value: true

Event: stripe payment succeeded (evt_004, has invoice)
  -> (no row -- covered by the invoice paid event)

Intermediate extraction models produce dimension rows. The nexus package pivots them into nexus_event_dimensions with one boolean column per dimension name, linked by event_id. Query:

SELECT SUM(m.revenue)
FROM nexus_event_measurements m
JOIN nexus_event_dimensions d ON m.event_id = d.event_id
WHERE d.is_revenue_generating

This works across all sources without any source-specific knowledge.

Why Booleans

Boolean dimensions (is_revenue_generating) are preferred over string tags or categories for universal concepts:

  • Self-documenting: WHERE is_revenue_generating reads like English. No string comparison, no quoting, no remembering valid values.
  • LLM-friendly: An LLM sees a boolean column in metadata and writes the correct query immediately. String categories require discovering the column AND the valid values -- more tool calls, more room for error.
  • Queryable: As simple as SQL gets. No ambiguity.

String-valued dimensions (categories) can coexist in the same EAV table for cases where mutually exclusive values within a group make sense (e.g., funnel_stage = 'conversion'). But for binary concepts like "is this event revenue-generating," booleans win.

Per-Event, Not Per-Source

Dimensions are per-event, not per-event-name or per-source. This is critical for edge cases:

  • Stripe payment succeeded events: some are revenue-generating (standalone payments), some are not (the payment is already covered by an invoice paid event). A simple list saying "payment succeeded = revenue generating" would be wrong. The per-event dimension handles the row-level granularity because the intermediate model has access to the actual data to make the determination.
  • Refunds: revenue-generating (negative revenue).
  • Test payments: never revenue-generating.

The source model engineer makes these calls while they have maximum context -- staring at the raw data, understanding the edge cases, writing the joins. The dimension is set once at build time and never leaks complexity downstream.

Which Concepts Deserve a Dimension?

Not every measurement needs a corresponding dimension. The test:

Does the business have a clear, unambiguous answer to "which events represent this concept being realized?"

  • revenue: Yes. A business knows exactly which events mean "money was earned." Multiple events per source are fine (Stripe invoice paid + standalone payments). -> is_revenue_generating
  • expense: Yes. A business knows which events represent costs incurred. -> is_expense_generating
  • num_nights: No. "How many nights?" is inherently contextual. There is no "nights were realized" event. The analyst SHOULD specify which event type. Forcing specificity produces better, more intentional queries.

The list of standard boolean dimensions should be very short -- truly universal concepts only.


Where Business Logic Belongs

The Principle

Semantic logic that requires data-level reasoning belongs in the transformation layer (dbt/nexus), not the semantic layer.

The transformation layer is where you have maximum context and minimum abstraction. The engineer building the source model knows:

  • "These payment events with invoice_id IS NOT NULL are already covered by invoice paid events -- don't double-tag them"
  • "Refunds are revenue-generating (negative revenue)"
  • "Test mode payments have livemode = false -- never tag those"
  • "This email field sometimes contains phone numbers"

Asking someone six months later to encode that same logic in a Cube YAML file -- far from the data, far from the source model, without the context of why certain payments are excluded -- is where mistakes happen. Or worse, the edge cases just don't get handled.

The Boundary

Operation type Where it belongs Example
Row-level data reasoning dbt/nexus (transformation) "Is this payment already covered by an invoice?"
Cross-source labeling dbt/nexus (transformation) "This value is revenue"
Cross-source dimensions dbt/nexus (transformation) "This event is_revenue_generating"
Aggregation function Inferrable / semantic layer SUM, AVG, COUNT
Post-aggregation arithmetic Semantic layer conversion_rate = SUM(purchases) / SUM(users)
Time intelligence Semantic layer / query time YoY, MTD, rolling averages

The moment you need to look at other rows, join tables, or apply conditional logic based on data, you are doing a transformation, not a definition. That belongs in dbt.

The Stripe Argument

This is the killer argument for doing semantic logic in dbt:

In Cube, determining which Stripe payments are revenue-generating would require something like:

# Hypothetical Cube config
measures:
  revenue:
    sql:
      "SUM(CASE WHEN event_name = 'invoice paid' THEN revenue WHEN event_name =
      'payment succeeded' AND payment_id NOT IN (SELECT payment_id FROM
      stripe_invoices) THEN revenue END)"

That is a correlated subquery in a metric definition. It runs on every query. It is fragile, slow, hard to test, and impossible for an LLM to generate. In dbt, it is a WHERE clause in the intermediate dimensions model that runs once at build time, is tested, is version-controlled, and produces a clean boolean.


Nexus as Semantic Layer

Nexus already provides most of what a traditional semantic layer defines:

Semantic layer concern Nexus provides it? How
What IS revenue across sources? Yes revenue measurement (cross-source alias)
Unit tagging Yes value_unit on measurements
Entity linkage Yes Events -> Participants -> Entities
Join paths Yes Fixed schema with documented joins
Discoverability Yes Metadata tables per facet
Cross-source dimensions Proposed nexus_event_dimensions
Aggregation functions Inferrable Stevens' scales (ratio -> SUM, ordinal -> median)
Derived metrics (post-aggregation) No conversion_rate = purchases / users
Time intelligence No YoY, MTD, rolling averages
Query API No Cube, LLM, GenUI provide this

The only genuine semantic layer concerns nexus cannot absorb are:

  1. Derived metrics: Post-aggregation arithmetic like take_rate = SUM(revenue) / SUM(gmv). These don't exist as rows -- they are computed after GROUP BY. They require a formula definition that any query layer can read and apply.
  2. Time intelligence: YoY comparisons, MTD, rolling averages. Query-time window operations.
  3. Query API / infrastructure: Caching, access control, a query endpoint.

Everything else -- the hard part of "what does the data mean" -- is handled by nexus in the transformation layer, where it is version-controlled, testable, and close to the data.

This aligns with the commoditized complements strategy: nexus makes the data substrate so clean and self-documenting that the query layer becomes a lightweight, swappable commodity. Cube, an LLM writing raw SQL, GenUI, a dashboard tool -- they all work because nexus already did the semantic work. The query layer adds infrastructure, not knowledge.

Self-Documenting Data

The design principle underlying all of this: data should be self-documenting. An LLM (or analyst) should be able to understand the data by reading column names and metadata tables, without external documentation or tribal knowledge.

  • revenue as a column name tells you what the value is
  • is_revenue_generating as a column name tells you which events to filter
  • Metadata tables tell you what measurements/dimensions exist and from which sources
  • The schema shape (fixed columns + dynamic pivoted columns) tells you what is universal vs source-specific

When the schema is self-documenting, the query layer's job is mechanical translation (natural language -> SQL), not interpretation. That is what makes nexus work for LLMs: the semantic meaning is in the data, not in a config file the LLM cannot see.