Entity Resolution Algorithm

How the Nexus entity resolution algorithm works — from source identifiers through edge creation, Jinja-unrolled traversal, and final entity tables.

Entity resolution (ER) is the process of determining that two or more identifiers refer to the same real-world entity — and merging them into a single resolved entity_id. It's how Nexus knows that john@company.com in Gmail, john@company.com in Stripe, and phone number 555-1234 from Notion all belong to the same person.

Not all entities need ER. Subscriptions, contracts, and other objects that exist in a single source system skip this entirely — see Entity Types for when to use ER vs non-ER registration. This document covers the ER pipeline for entity types configured with entity_resolution: true.


Pipeline Overview

Source Identifiers → Union → Edge Creation → Connected-Components Resolution → Entity Table → Participants
Step Model What Happens
0 *_entity_identifiers Source formats identifiers into standard schema
1 nexus_entity_identifiers All sources unioned into one table
2 nexus_entity_identifiers_edges Edges created between co-occurring identifiers
3 nexus_resolved_{type}_identifiers Jinja-unrolled connected-components resolution
4 nexus_entities Final entity table with pivoted traits
5 nexus_entity_participants Events linked to resolved entities

Each ER entity type (e.g., person, group) is resolved independently for performance and debuggability.


Step 0: Source Identifier Formatting

Each source creates an *_entity_identifiers model that extracts identifiers from events into a standard schema:

entity_identifier_id  -- Unique ID (ent_idfr_ prefix)
event_id              -- Links to the originating event
edge_id               -- Groups identifiers that belong together
entity_type           -- 'person', 'group', etc.
identifier_type       -- e.g., 'email', 'domain', 'phone'
identifier_value      -- The actual value
role                  -- Entity's role in the event (optional)
occurred_at           -- Event timestamp
source                -- Source system name

The edge_id is the key concept: identifiers sharing the same edge_id are assumed to belong to the same entity. Typically edge_id = event_id when each event involves a single entity. For events with multiple participants, use a composite like event_id || identifier_value to avoid linking unrelated entities.

Example — Gmail person identifiers:

SELECT
    {{ nexus.create_nexus_id('entity_identifier', ['event_id', 'email', "'person'", "'sender'"]) }}
        as entity_identifier_id,
    event_id,
    event_id as edge_id,
    'person' as entity_type,
    'email' as identifier_type,
    sender_email as identifier_value,
    'sender' as role,
    occurred_at,
    'gmail' as source
FROM {{ ref('gmail_message_events') }}
WHERE sender_email IS NOT NULL

Sources union their intermediate identifier models into a single {source}_entity_identifiers model:

-- stripe_entity_identifiers.sql
{{ dbt_utils.union_relations([
    ref('stripe_person_identifiers'),
    ref('stripe_group_identifiers')
]) }}

Step 1: Identifier Collection

The process_entity_identifiers() macro unions all enabled source identifier models into nexus_entity_identifiers — a single table containing every identifier from every source, tagged with entity_type.

This is the input to the ER algorithm. No resolution has happened yet — the same person might appear as dozens of separate rows with different identifiers from different sources.


Step 2: Edge Creation

Model: nexus_entity_identifiers_edges

Edges represent connections between identifiers that should resolve to the same entity. Two identifiers are connected when they share the same edge_id.

The create_identifier_edges() macro:

  1. Self-joins nexus_entity_identifiers on edge_id within each entity_type to find co-occurring identifier pairs
  2. Deduplicates using a surrogate key on identifier types and values — the same two identifiers connected by 10,000 events produce exactly one edge
  3. Optionally filters based on edge quality thresholds (see Edge Quality)
-- Conceptual logic (simplified)
SELECT DISTINCT
    a.identifier_type  as identifier_type_a,
    a.identifier_value as identifier_value_a,
    b.identifier_type  as identifier_type_b,
    b.identifier_value as identifier_value_b
FROM nexus_entity_identifiers a
JOIN nexus_entity_identifiers b
    ON a.edge_id = b.edge_id
    AND a.entity_type = b.entity_type
    AND (a.identifier_type != b.identifier_type
         OR a.identifier_value != b.identifier_value)

Why Deduplication Matters

Without deduplication, high-frequency entities create massive edge explosion. An entity with 26,000 events and 2 identifier types would generate 26,000 duplicate edges. Across all entities this becomes billions of redundant rows.

With surrogate-key deduplication, only unique identifier relationships are kept:

Metric Before After
Edges 1.8M duplicates 790 unique
Processing time Hours ~4 seconds
Reduction 99.96%

Step 3: Connected-Components Resolution

Model: nexus_resolved_{type}_identifiers (e.g., nexus_resolved_person_identifiers, nexus_resolved_group_identifiers)

This is the core of entity resolution. The resolve_identifiers() macro implements a depth-bounded connected-components algorithm.

How It's Rendered: Jinja-Unrolled Traversal

Because max_recursion is a dbt var, it's known at Jinja-compile time. We exploit that to unroll the traversal: render one CTE per hop using a Jinja for loop, instead of using SQL's WITH RECURSIVE primitive. Each level uses UNION (or UNION DISTINCT on BigQuery) which dedups (component, reachable) pairs — cycles are absorbed naturally with no path tracking required.

This is a real architectural choice, not just style. WITH RECURSIVE can't spill to disk on Snowflake (causing OOMs at scale) and has structural restrictions on BigQuery (can't be embedded in surrounding CTEs or UNION ALL clauses). Plain CTEs avoid both problems and use the warehouse's standard hash-join machinery, which spills gracefully when memory is tight. See Why not WITH RECURSIVE? below for the full rationale.

Phase A: Component Discovery

Every identifier starts as its own component. Each compile-time-unrolled CTE extends reachability one more hop:

WITH depth_0 AS (
    -- Base: every identifier is its own component
    SELECT DISTINCT
        identifier_type as component_identifier_type,
        identifier_value as component_identifier_value,
        identifier_type,
        identifier_value
    FROM nexus_entity_identifiers
    WHERE entity_type = '{type}'
)
{% for level in range(1, max_recursion + 1) %}
, depth_{{ level }} AS (
    -- Hop {{ level }}: extend reachability one more edge
    SELECT * FROM depth_{{ level - 1 }}
    UNION
    SELECT
        d.component_identifier_type,
        d.component_identifier_value,
        e.identifier_type_b,
        e.identifier_value_b
    FROM depth_{{ level - 1 }} d
    JOIN nexus_entity_identifiers_edges e
      ON d.identifier_type  = e.identifier_type_a
     AND d.identifier_value = e.identifier_value_a
)
{% endfor %}

For max_recursion: 3, this renders 4 CTEs (depth_0 through depth_3). The final depth_N is the set of all (component_identifier, reachable_identifier) pairs within N hops.

Phase B: Entity ID Assignment

Each identifier is assigned to the lexicographically first identifier in its connected component. A deterministic entity_id is generated from that component root:

SELECT
    identifier_type,
    identifier_value,
    {{ nexus.create_nexus_id('entity', [...]) }} as entity_id
FROM depth_{max_recursion}

Example — before and after resolution:

Before:
  email: john@company.com    → (no entity_id yet)
  phone: 555-1234            → (no entity_id yet)
  email: john@newcompany.com → (no entity_id yet)

After (all connected via edges):
  email: john@company.com    → entity_id: ent_a1b2c3...
  phone: 555-1234            → entity_id: ent_a1b2c3...
  email: john@newcompany.com → entity_id: ent_a1b2c3...

Recursion Depth

Configure with nexus.max_recursion in dbt_project.yml (default: 5). For most datasets, 3 is sufficient and significantly faster. Each level adds one more self-join over the edges table; deeper depths have proportionally higher cost.

vars:
  nexus:
    max_recursion: 3

Why not WITH RECURSIVE?

Earlier versions of this macro used SQL's WITH RECURSIVE primitive with per-row path-string concatenation for cycle detection. Three problems:

  1. Snowflake OOMs at scale. The recursive operator can't spill to disk gracefully. On graphs of ~10M+ identifiers, intermediate path-tracking state exceeds per-query memory and the query aborts with error 100298 (Recursive Join ran out of memory).
  2. BigQuery structural restrictions. WITH RECURSIVE in BigQuery must be top-level of the WITH clause, can't reference upstream CTEs in its recursive body, and can't be UNION ALL'd with non-recursive CTEs. This forced architectural decisions like splitting person/group resolution into separate models.
  3. Substring-collision bug in Snowflake cycle detection. The cycle check not contains(rc.path, type || ':' || value) did a substring match on a concatenated path string. Short identifier values (email=p, phone=512982) appeared as substrings inside longer path strings and incorrectly blocked legitimate traversals, stranding affected identifiers from their true connected components.

The Jinja-unrolled form fixes all three: regular hash joins spill cleanly, there are no structural restrictions on surrounding SQL, and UNION-based dedup is set-membership (no substring false positives).

The trade-off is rendered SQL length: at max_recursion=3 it's 4 CTE blocks instead of 1 recursive block. At higher depths the rendered SQL grows linearly. Compile time and warehouse SQL-length limits are not practical constraints at the scales we run.


Step 4: Entity Table

Model: nexus_entities

The finalize_entities() macro creates the final entity table by:

  1. Pivoting resolved pre-resolution traits (name, email, domain, etc.) from EAV format into columns — trait columns are discovered dynamically at compile time
  2. Pivoting computed traits (risk_tier, display_name, etc.) from nexus_computed_traits into additional columns — also discovered at compile time
  3. Computing timestamps: _created_at, _updated_at, _last_merged_at, first_interaction_at, last_interaction_at
  4. Combining ER entities with non-ER registered entities into a single table

The output has one row per entity with trait values (both pre-resolution and computed) as columns. See Entities for the full trait lifecycle.


Step 5: Event Participation

Model: nexus_entity_participants

The finalize_participants() macro links resolved entities back to events:

  1. Joins nexus_entity_identifiers with resolved identifiers to map each event's identifiers to their resolved entity_id
  2. Preserves the role from Step 0 (e.g., sender, recipient, contact)
  3. Deduplicates — if the same entity participates in the same event via multiple identifiers, only one participation record is kept per role

This is the bridge table that connects nexus_events to nexus_entities.


Algorithm Properties

Correctness

  • Transitivity: If A connects to B and B connects to C, all three get the same entity_id
  • Determinism: Same input always produces the same entity IDs
  • Completeness: Every source identifier is resolved

Performance Characteristics

Component Complexity Optimization
Identifier collection O(n) events Table materialization
Edge creation O(u) unique edges Surrogate-key deduplication
Traversal resolution O(u × d) Jinja-unrolled per-hop CTEs
Entity finalization O(g) entities Single-pass trait pivot
Participation O(n) events Efficient joins

Where n = events, u = unique edges, d = max traversal depth, g = unique entities.


Relationship to Non-ER Entities

Non-ER entity types (entity_resolution: false) skip Steps 2-3 entirely. They register directly via register_entities() and are combined with ER entities in Step 4. Their event participation is resolved by a simpler lookup against the registration model rather than the full ER pipeline.

See Entity Types for details.