BigQuery optimizations
Optional `partition_by`, `cluster_by`, and informational `PRIMARY KEY` / `FOREIGN KEY` constraints applied to the six core nexus_* models. Single feature toggle; on by default for BigQuery, off elsewhere.
A single toggle adds three classes of warehouse-side optimization to the canonical core models. All three are BigQuery-only, all three are on by default on BigQuery and a clean no-op on Snowflake, and the whole feature can be opted out of with one line of YAML if a tenant's needs differ.
The toggle
# In your dbt_project.yml
vars:
nexus:
warehouse_optimization:
enabled: null # default — auto: on for BigQuery, off elsewhere
# enabled: false # force off (BigQuery user opts out)
# enabled: true # force on (Snowflake user opts in — applies cluster_by only)
Auto mode (null or unset) is what every consuming project gets out
of the box.
What gets applied per model
| Model | partition_by |
cluster_by |
PK | FK |
|---|---|---|---|---|
nexus_events |
occurred_at (month) |
event_name, source |
event_id |
— |
nexus_event_dimensions |
occurred_at (month) |
event_id |
event_id |
event_id → events |
nexus_event_measurements |
occurred_at (month) |
event_id |
event_id |
event_id → events |
nexus_entity_participants |
occurred_at (month) |
entity_type, event_id |
entity_participant_id |
event_id → events, entity_id → entities |
nexus_entities |
(none — no temporal col) | entity_type |
entity_id |
— |
nexus_relationships |
(none) | relationship_type, entity_a_type, entity_b_type |
relationship_id |
entity_a_id → entities, entity_b_id → entities |
Pre-existing keys on nexus_entity_states and
nexus_entity_state_participants are left alone.
Why these specific keys?
Not invented — derived from a corpus grep across every production
artifact in app/src/artifacts/. The dominant access pattern:
- 173+
occurred_at >=and 158+occurred_at <predicates → partition byoccurred_at. event_name = 'enrollment'61×,event_name = 'search'19×,source = 'segment'12× → clusternexus_eventsbyevent_name, source.entity_type = 'person'120×,'group'39× → clusternexus_entitiesandnexus_entity_participantsonentity_type.relationship_type = 'advisor'19×,'member'6× → clusternexus_relationshipsonrelationship_typefirst.- 110+ joins on
event_id→ cluster the dimension / measurement / participant tables onevent_idfor hash-join locality and potential PK-driven block prune.
If your tenant's workload skews differently, the toggle is a clean opt-out; per-tenant overrides via dbt post-hooks in your own project are the right next move.
Why monthly partitioning, not daily?
Real tenant event histories routinely span more than BigQuery's 4000-partition cap when partitioned daily. We saw this on a real client whose Gmail history goes back >11 years — daily CTAS fails with "Too many partitions produced by query."
Monthly granularity gives 4000 / 12 ≈ 333 years of headroom while
still pruning a 90-day query to ~4 partitions ≈ 120 days scanned
(versus the 90 daily would give). The difference is dwarfed by going
from "scan everything" to "scan four months".
Informational PK/FK constraints
The BigQuery ALTER TABLE ... ADD PRIMARY KEY ... NOT ENFORCED and
ADD FOREIGN KEY ... NOT ENFORCED constraints are hints to the
query optimizer, not validations. They're free (no storage, no
validation pass, idempotent re-application), and they let BigQuery
do two important things:
- Join elimination. A query like
SELECT e.event_id FROM nexus_events e LEFT JOIN nexus_event_dimensions d ON e.event_id = d.event_idscans onlynexus_events. BigQuery sees the FK→PK chain, knows the join produces exactly one matched row per left row (so it can't change the result), and drops the join entirely. - Better cardinality estimates → better join order. When BigQuery
knows
event_idis unique on one side, it picks the right build side and the right join algorithm.
The safety net
NOT ENFORCED means BigQuery trusts the constraint blindly. If a
declared PK had duplicates, BigQuery's plans based on that
"uniqueness" would produce wrong results, not just slow ones.
dbt-nexus makes the constraints safe by pairing every declared PK
with an error-severity unique test on the same column. If the
PK ever drifts, the dbt build halts before the constraint can be
applied. The build halt is what makes "informational" trustworthy
here.
FK referential integrity isn't formally tested, but it holds by
build graph topology — every nexus_event_dimensions,
nexus_event_measurements, and nexus_entity_participants row is
derived from a nexus_events row, and nexus_relationships /
nexus_entity_participants derive from nexus_entities. There's no
way to produce an orphan FK in the canonical pipeline.
Snowflake behavior
The feature defaults off on Snowflake because Snowflake auto-clustering is billed background work — opt-in only.
If a Snowflake operator sets enabled: true explicitly, dbt-nexus
applies cluster_by (Snowflake supports it) but skips partition_by
(BigQuery-only concept; the macro returns none) and skips the
informational constraints (Snowflake's PRIMARY KEY / FOREIGN KEY
constraint semantics differ and aren't part of this feature).
Implementation notes
For maintainers / contributors:
- The toggle is a single Jinja boolean macro
(
macros/config/nexus_warehouse_optimization_enabled.sql) consulted by both the partition and cluster helper macros. - Cluster and partition kwargs are generated by macros that return
nonewhen the toggle is off —partition_by=noneandcluster_by=noneare dbt's "no-op" forms, so the model behaves exactly as before when disabled. - The
participants → entitiesFK is attached fromnexus_entities'post_hook(as anexternal_foreign_keysentry), not fromnexus_entity_participants. Reason:nexus_entitiestransitively depends onnexus_entity_participantsvia identity resolution, so the naturaldepends_ondirection (participants knows about entities) would cycle the dbt compile DAG. - FK targets are constructed lexically from
{{ this.database }}.{{ this.schema }}.<sibling_table>, not viaref(). Going throughref()would re-introduce that DAG cycle by adding a constraint-time dependency edge. - All core models live in the same dataset by package convention
(
nexus_{{ target.name }}), so the lexical sibling lookup is safe.