BigQuery optimizations

Optional `partition_by`, `cluster_by`, and informational `PRIMARY KEY` / `FOREIGN KEY` constraints applied to the six core nexus_* models. Single feature toggle; on by default for BigQuery, off elsewhere.

A single toggle adds three classes of warehouse-side optimization to the canonical core models. All three are BigQuery-only, all three are on by default on BigQuery and a clean no-op on Snowflake, and the whole feature can be opted out of with one line of YAML if a tenant's needs differ.

The toggle

# In your dbt_project.yml
vars:
  nexus:
    warehouse_optimization:
      enabled: null   # default — auto: on for BigQuery, off elsewhere
      # enabled: false  # force off (BigQuery user opts out)
      # enabled: true   # force on (Snowflake user opts in — applies cluster_by only)

Auto mode (null or unset) is what every consuming project gets out of the box.

What gets applied per model

Model partition_by cluster_by PK FK
nexus_events occurred_at (month) event_name, source event_id
nexus_event_dimensions occurred_at (month) event_id event_id event_id → events
nexus_event_measurements occurred_at (month) event_id event_id event_id → events
nexus_entity_participants occurred_at (month) entity_type, event_id entity_participant_id event_id → events, entity_id → entities
nexus_entities (none — no temporal col) entity_type entity_id
nexus_relationships (none) relationship_type, entity_a_type, entity_b_type relationship_id entity_a_id → entities, entity_b_id → entities

Pre-existing keys on nexus_entity_states and nexus_entity_state_participants are left alone.

Why these specific keys?

Not invented — derived from a corpus grep across every production artifact in app/src/artifacts/. The dominant access pattern:

  • 173+ occurred_at >= and 158+ occurred_at < predicates → partition by occurred_at.
  • event_name = 'enrollment' 61×, event_name = 'search' 19×, source = 'segment' 12× → cluster nexus_events by event_name, source.
  • entity_type = 'person' 120×, 'group' 39× → cluster nexus_entities and nexus_entity_participants on entity_type.
  • relationship_type = 'advisor' 19×, 'member' 6× → cluster nexus_relationships on relationship_type first.
  • 110+ joins on event_id → cluster the dimension / measurement / participant tables on event_id for hash-join locality and potential PK-driven block prune.

If your tenant's workload skews differently, the toggle is a clean opt-out; per-tenant overrides via dbt post-hooks in your own project are the right next move.

Why monthly partitioning, not daily?

Real tenant event histories routinely span more than BigQuery's 4000-partition cap when partitioned daily. We saw this on a real client whose Gmail history goes back >11 years — daily CTAS fails with "Too many partitions produced by query."

Monthly granularity gives 4000 / 12 ≈ 333 years of headroom while still pruning a 90-day query to ~4 partitions ≈ 120 days scanned (versus the 90 daily would give). The difference is dwarfed by going from "scan everything" to "scan four months".

Informational PK/FK constraints

The BigQuery ALTER TABLE ... ADD PRIMARY KEY ... NOT ENFORCED and ADD FOREIGN KEY ... NOT ENFORCED constraints are hints to the query optimizer, not validations. They're free (no storage, no validation pass, idempotent re-application), and they let BigQuery do two important things:

  1. Join elimination. A query like SELECT e.event_id FROM nexus_events e LEFT JOIN nexus_event_dimensions d ON e.event_id = d.event_id scans only nexus_events. BigQuery sees the FK→PK chain, knows the join produces exactly one matched row per left row (so it can't change the result), and drops the join entirely.
  2. Better cardinality estimates → better join order. When BigQuery knows event_id is unique on one side, it picks the right build side and the right join algorithm.

The safety net

NOT ENFORCED means BigQuery trusts the constraint blindly. If a declared PK had duplicates, BigQuery's plans based on that "uniqueness" would produce wrong results, not just slow ones.

dbt-nexus makes the constraints safe by pairing every declared PK with an error-severity unique test on the same column. If the PK ever drifts, the dbt build halts before the constraint can be applied. The build halt is what makes "informational" trustworthy here.

FK referential integrity isn't formally tested, but it holds by build graph topology — every nexus_event_dimensions, nexus_event_measurements, and nexus_entity_participants row is derived from a nexus_events row, and nexus_relationships / nexus_entity_participants derive from nexus_entities. There's no way to produce an orphan FK in the canonical pipeline.

Snowflake behavior

The feature defaults off on Snowflake because Snowflake auto-clustering is billed background work — opt-in only.

If a Snowflake operator sets enabled: true explicitly, dbt-nexus applies cluster_by (Snowflake supports it) but skips partition_by (BigQuery-only concept; the macro returns none) and skips the informational constraints (Snowflake's PRIMARY KEY / FOREIGN KEY constraint semantics differ and aren't part of this feature).

Implementation notes

For maintainers / contributors:

  • The toggle is a single Jinja boolean macro (macros/config/nexus_warehouse_optimization_enabled.sql) consulted by both the partition and cluster helper macros.
  • Cluster and partition kwargs are generated by macros that return none when the toggle is off — partition_by=none and cluster_by=none are dbt's "no-op" forms, so the model behaves exactly as before when disabled.
  • The participants → entities FK is attached from nexus_entities' post_hook (as an external_foreign_keys entry), not from nexus_entity_participants. Reason: nexus_entities transitively depends on nexus_entity_participants via identity resolution, so the natural depends_on direction (participants knows about entities) would cycle the dbt compile DAG.
  • FK targets are constructed lexically from {{ this.database }}.{{ this.schema }}.<sibling_table>, not via ref(). Going through ref() would re-introduce that DAG cycle by adding a constraint-time dependency edge.
  • All core models live in the same dataset by package convention (nexus_{{ target.name }}), so the lexical sibling lookup is safe.