Nexus Data

Entity resolution (ER) is the process of determining that two or more identifiers refer to the same real-world entity — and merging them into a single resolved entity_id. It's how Nexus knows that john@company.com in Gmail, john@company.com in Stripe, and phone number 555-1234 from Notion all belong to the same person.

Not all entities need ER. Subscriptions, contracts, and other objects that exist in a single source system skip this entirely — see Entity Types for when to use ER vs non-ER registration. This document covers the ER pipeline for entity types configured with entity_resolution: true.

Pipeline Overview

Source Identifiers → Union → Edge Creation → Connected-Components Resolution → Entity Table → Participants

Step	Model	What Happens
0	`*_entity_identifiers`	Source formats identifiers into standard schema
1	`nexus_entity_identifiers`	All sources unioned into one table
2	`nexus_entity_identifiers_edges`	Edges created between co-occurring identifiers
3	`nexus_resolved_{type}_identifiers`	Jinja-unrolled connected-components resolution
4	`nexus_entities`	Final entity table with pivoted traits
5	`nexus_entity_participants`	Events linked to resolved entities

Each ER entity type (e.g., person, group) is resolved independently for performance and debuggability.

Step 0: Source Identifier Formatting

Each source creates an *_entity_identifiers model that extracts identifiers from events into a standard schema:

entity_identifier_id  -- Unique ID (ent_idfr_ prefix)
event_id              -- Links to the originating event
edge_id               -- Groups identifiers that belong together
entity_type           -- 'person', 'group', etc.
identifier_type       -- e.g., 'email', 'domain', 'phone'
identifier_value      -- The actual value
role                  -- Entity's role in the event (optional)
occurred_at           -- Event timestamp
source                -- Source system name

The edge_id is the key concept: identifiers sharing the same edge_id are assumed to belong to the same entity. Typically edge_id = event_id when each event involves a single entity. For events with multiple participants, use a composite like event_id || identifier_value to avoid linking unrelated entities.

Example — Gmail person identifiers:

SELECT
    {{ nexus.create_nexus_id('entity_identifier', ['event_id', 'email', "'person'", "'sender'"]) }}
        as entity_identifier_id,
    event_id,
    event_id as edge_id,
    'person' as entity_type,
    'email' as identifier_type,
    sender_email as identifier_value,
    'sender' as role,
    occurred_at,
    'gmail' as source
FROM {{ ref('gmail_message_events') }}
WHERE sender_email IS NOT NULL

Sources union their intermediate identifier models into a single {source}_entity_identifiers model:

-- stripe_entity_identifiers.sql
{{ dbt_utils.union_relations([
    ref('stripe_person_identifiers'),
    ref('stripe_group_identifiers')
]) }}

Step 1: Identifier Collection

The process_entity_identifiers() macro unions all enabled source identifier models into nexus_entity_identifiers — a single table containing every identifier from every source, tagged with entity_type.

This is the input to the ER algorithm. No resolution has happened yet — the same person might appear as dozens of separate rows with different identifiers from different sources.

Step 2: Edge Creation

Model: nexus_entity_identifiers_edges

Edges represent connections between identifiers that should resolve to the same entity. Two identifiers are connected when they share the same edge_id.

The create_identifier_edges() macro:

Self-joins nexus_entity_identifiers on edge_id within each entity_type to find co-occurring identifier pairs
Deduplicates using a surrogate key on identifier types and values — the same two identifiers connected by 10,000 events produce exactly one edge
Optionally filters based on edge quality thresholds (see Edge Quality)

-- Conceptual logic (simplified)
SELECT DISTINCT
    a.identifier_type  as identifier_type_a,
    a.identifier_value as identifier_value_a,
    b.identifier_type  as identifier_type_b,
    b.identifier_value as identifier_value_b
FROM nexus_entity_identifiers a
JOIN nexus_entity_identifiers b
    ON a.edge_id = b.edge_id
    AND a.entity_type = b.entity_type
    AND (a.identifier_type != b.identifier_type
         OR a.identifier_value != b.identifier_value)

Why Deduplication Matters

Without deduplication, high-frequency entities create massive edge explosion. An entity with 26,000 events and 2 identifier types would generate 26,000 duplicate edges. Across all entities this becomes billions of redundant rows.

With surrogate-key deduplication, only unique identifier relationships are kept:

Metric	Before	After
Edges	1.8M duplicates	790 unique
Processing time	Hours	~4 seconds
Reduction	—	99.96%

Step 3: Connected-Components Resolution

Model: nexus_resolved_{type}_identifiers (e.g., nexus_resolved_person_identifiers, nexus_resolved_group_identifiers)

This is the core of entity resolution. The resolve_identifiers() macro implements a depth-bounded connected-components algorithm.

How It's Rendered: Jinja-Unrolled Traversal

Because max_recursion is a dbt var, it's known at Jinja-compile time. We exploit that to unroll the traversal: render one CTE per hop using a Jinja for loop, instead of using SQL's WITH RECURSIVE primitive. Each level uses UNION (or UNION DISTINCT on BigQuery) which dedups (component, reachable) pairs — cycles are absorbed naturally with no path tracking required.

This is a real architectural choice, not just style. WITH RECURSIVE can't spill to disk on Snowflake (causing OOMs at scale) and has structural restrictions on BigQuery (can't be embedded in surrounding CTEs or UNION ALL clauses). Plain CTEs avoid both problems and use the warehouse's standard hash-join machinery, which spills gracefully when memory is tight. See Why not WITH RECURSIVE? below for the full rationale.

Phase A: Component Discovery

Every identifier starts as its own component. Each compile-time-unrolled CTE extends reachability one more hop:

WITH depth_0 AS (
    -- Base: every identifier is its own component
    SELECT DISTINCT
        identifier_type as component_identifier_type,
        identifier_value as component_identifier_value,
        identifier_type,
        identifier_value
    FROM nexus_entity_identifiers
    WHERE entity_type = '{type}'
)
{% for level in range(1, max_recursion + 1) %}
, depth_{{ level }} AS (
    -- Hop {{ level }}: extend reachability one more edge
    SELECT * FROM depth_{{ level - 1 }}
    UNION
    SELECT
        d.component_identifier_type,
        d.component_identifier_value,
        e.identifier_type_b,
        e.identifier_value_b
    FROM depth_{{ level - 1 }} d
    JOIN nexus_entity_identifiers_edges e
      ON d.identifier_type  = e.identifier_type_a
     AND d.identifier_value = e.identifier_value_a
)
{% endfor %}

For max_recursion: 3, this renders 4 CTEs (depth_0 through depth_3). The final depth_N is the set of all (component_identifier, reachable_identifier) pairs within N hops.

Phase B: Entity ID Assignment

Each identifier is assigned to the lexicographically first identifier in its connected component. A deterministic entity_id is generated from that component root:

SELECT
    identifier_type,
    identifier_value,
    {{ nexus.create_nexus_id('entity', [...]) }} as entity_id
FROM depth_{max_recursion}

Example — before and after resolution:

Before:
  email: john@company.com    → (no entity_id yet)
  phone: 555-1234            → (no entity_id yet)
  email: john@newcompany.com → (no entity_id yet)

After (all connected via edges):
  email: john@company.com    → entity_id: ent_a1b2c3...
  phone: 555-1234            → entity_id: ent_a1b2c3...
  email: john@newcompany.com → entity_id: ent_a1b2c3...

Recursion Depth

Configure with nexus.max_recursion in dbt_project.yml (default: 5). For most datasets, 3 is sufficient and significantly faster. Each level adds one more self-join over the edges table; deeper depths have proportionally higher cost.

vars:
  nexus:
    max_recursion: 3

Why not `WITH RECURSIVE`?

Earlier versions of this macro used SQL's WITH RECURSIVE primitive with per-row path-string concatenation for cycle detection. Three problems:

Snowflake OOMs at scale. The recursive operator can't spill to disk gracefully. On graphs of ~10M+ identifiers, intermediate path-tracking state exceeds per-query memory and the query aborts with error 100298 (Recursive Join ran out of memory).
BigQuery structural restrictions. WITH RECURSIVE in BigQuery must be top-level of the WITH clause, can't reference upstream CTEs in its recursive body, and can't be UNION ALL'd with non-recursive CTEs. This forced architectural decisions like splitting person/group resolution into separate models.
Substring-collision bug in Snowflake cycle detection. The cycle check not contains(rc.path, type || ':' || value) did a substring match on a concatenated path string. Short identifier values (email=p, phone=512982) appeared as substrings inside longer path strings and incorrectly blocked legitimate traversals, stranding affected identifiers from their true connected components.

The Jinja-unrolled form fixes all three: regular hash joins spill cleanly, there are no structural restrictions on surrounding SQL, and UNION-based dedup is set-membership (no substring false positives).

The trade-off is rendered SQL length: at max_recursion=3 it's 4 CTE blocks instead of 1 recursive block. At higher depths the rendered SQL grows linearly. Compile time and warehouse SQL-length limits are not practical constraints at the scales we run.

Step 4: Entity Table

Model: nexus_entities

The finalize_entities() macro creates the final entity table by:

Pivoting resolved pre-resolution traits (name, email, domain, etc.) from EAV format into columns — trait columns are discovered dynamically at compile time
Pivoting computed traits (risk_tier, display_name, etc.) from nexus_computed_traits into additional columns — also discovered at compile time
Computing timestamps: _created_at, _updated_at, _last_merged_at, first_interaction_at, last_interaction_at
Combining ER entities with non-ER registered entities into a single table

The output has one row per entity with trait values (both pre-resolution and computed) as columns. See Entities for the full trait lifecycle.

Step 5: Event Participation

Model: nexus_entity_participants

The finalize_participants() macro links resolved entities back to events:

Joins nexus_entity_identifiers with resolved identifiers to map each event's identifiers to their resolved entity_id
Preserves the role from Step 0 (e.g., sender, recipient, contact)
Deduplicates — if the same entity participates in the same event via multiple identifiers, only one participation record is kept per role

This is the bridge table that connects nexus_events to nexus_entities.

Algorithm Properties

Correctness

Transitivity: If A connects to B and B connects to C, all three get the same entity_id
Determinism: Same input always produces the same entity IDs
Completeness: Every source identifier is resolved

Performance Characteristics

Component	Complexity	Optimization
Identifier collection	O(n) events	Table materialization
Edge creation	O(u) unique edges	Surrogate-key deduplication
Traversal resolution	O(u × d)	Jinja-unrolled per-hop CTEs
Entity finalization	O(g) entities	Single-pass trait pivot
Participation	O(n) events	Efficient joins

Where n = events, u = unique edges, d = max traversal depth, g = unique entities.

Relationship to Non-ER Entities

Non-ER entity types (entity_resolution: false) skip Steps 2-3 entirely. They register directly via register_entities() and are combined with ER entities in Step 4. Their event participation is resolved by a simpler lookup against the registration model rather than the full ER pipeline.

See Entity Types for details.

Incremental Entity Resolution — Flag-gated incremental resolution via graph contraction (process only what arrived since the last run)
Entity Types — When to use ER vs non-ER
Edge Quality — Preventing over-merge with edge quality thresholds
Troubleshooting — Debugging resolution issues
Sources — How to create source identifier models