Core Event Schema Boundary

Design principle for what belongs on nexus_events vs facet tables (dimensions, measurements, source tables).

Principle

Every column on nexus_events must be required. If a field cannot be guaranteed on every event from every source, it does not belong on the core event table — it belongs in a facet table (dimensions, measurements, identifiers, traits, attribution) or on the source event table.

There are no optional columns on nexus_events.

Why

The core event table is the nexus contract. Downstream consumers — output models, semantic layer generators, LLM agents — need to know exactly what they can rely on without checking for nulls or reading source-specific documentation. A column that is "usually there" is worse than a column that is always there or a column that lives in a well-defined facet table with metadata.

Optional columns create ambiguity:

  • Is it null because the source doesn't have this concept, or because of a bug?
  • Should the semantic layer expose it as a dimension? It's not in the metadata pipeline.
  • Does an LLM know to check for it? It's not in the facet catalog.

Required columns eliminate these questions. Facet tables with EAV metadata answer them systematically for everything else.

Current Required Fields

Field Type Description
event_id STRING Unique nexus event identifier
occurred_at TIMESTAMP Business timestamp
event_type STRING Event category
event_name STRING Specific event action
source STRING Source system name

Fields to Evaluate

The following fields are currently on or near the core event schema and need to be evaluated against the "must be required" rule:

Field Current Status Evaluation
event_description Optional Could be required (soft — warning test). Every event can produce a human-readable description.
significance Optional Candidate to move to measurements or dimensions.
_ingested_at Optional Operational metadata. Could be required with warning-level test.
_processed_at Optional Operational metadata. Could be required with warning-level test.

Where Non-Core Data Belongs

Data type Home Example
Quantitative values Measurements (EAV → pivot) revenue, annual_premium_price
Cross-source categorical tags Dimensions (EAV → pivot) is_revenue_earned, source_record_id

source_record_id is a good example of why this boundary matters. It is the source system's primary business identifier for the event record (contract number, invoice ID, order ID). It is broadly useful but not universal — GA4 pageviews, synthetic events, and session boundaries have no meaningful source record ID. That makes it ineligible for the core event table and a natural fit for dimensions, where absence means null in the pivot without ambiguity. | Person/group identifiers | Identifiers (EAV) | email, phone, customer_id | | Entity properties | Traits (EAV) | first_name, city, plan_type | | Touchpoint click IDs | Attribution models | fbclid, gclid | | Source-specific fields | Source event tables | contract_number, appointment_status |

The Join Tradeoff

Cleaner separation of concerns means more joins. This is an intentional choice:

  • nexus_events alone answers "what happened?" — event counts, timelines, source breakdowns. That covers a large class of questions.
  • Adding "how much?" requires joining measurements.
  • Adding "which business concept?" requires joining dimensions.
  • Adding "who was involved?" requires joining participants and entities.

Each join adds a well-defined facet with its own metadata, tests, and semantic layer discoverability. The alternative — a wide events table with every possible column — trades discoverability and contract stability for fewer joins.

These Joins Are Cheap

The pivoted facet tables (nexus_event_dimensions, nexus_event_measurements) have one row per event_id — the same grain as nexus_events. Joining them is a 1:1 primary key join, which modern columnar engines (Snowflake, BigQuery, Databricks) optimize to a hash lookup. There is no cardinality fan-out, no row explosion, no complexity. The "cost" of these joins is effectively zero at query time.

The real cost of a wide events table is not avoiding joins — it is losing contract stability, metadata discoverability, and the ability to add new facets without altering the core schema.

Impact on Semantic Layer Generation

The facet pipeline (EAV → union → pivot → metadata) is the mechanism nexus uses for auto-discovery. A field on nexus_events is not self-describing — the generator must be taught about it explicitly. A field in a facet table is automatically cataloged in the metadata table and available for semantic layer generation without configuration.

This is the strongest argument for keeping nexus_events narrow: everything outside it flows through a pipeline that makes it discoverable.