Entity Resolution Algorithm
How the Nexus entity resolution algorithm works — from source identifiers through edge creation, Jinja-unrolled traversal, and final entity tables.
Entity resolution (ER) is the process of determining that two or more
identifiers refer to the same real-world entity — and merging them into a single
resolved entity_id. It's how Nexus knows that john@company.com in Gmail,
john@company.com in Stripe, and phone number 555-1234 from Notion all belong
to the same person.
Not all entities need ER. Subscriptions, contracts, and other objects that exist
in a single source system skip this entirely — see
Entity Types for when to use ER vs non-ER
registration. This document covers the ER pipeline for entity types configured
with entity_resolution: true.
Pipeline Overview
Source Identifiers → Union → Edge Creation → Connected-Components Resolution → Entity Table → Participants
| Step | Model | What Happens |
|---|---|---|
| 0 | *_entity_identifiers |
Source formats identifiers into standard schema |
| 1 | nexus_entity_identifiers |
All sources unioned into one table |
| 2 | nexus_entity_identifiers_edges |
Edges created between co-occurring identifiers |
| 3 | nexus_resolved_{type}_identifiers |
Jinja-unrolled connected-components resolution |
| 4 | nexus_entities |
Final entity table with pivoted traits |
| 5 | nexus_entity_participants |
Events linked to resolved entities |
Each ER entity type (e.g., person, group) is resolved independently for
performance and debuggability.
Step 0: Source Identifier Formatting
Each source creates an *_entity_identifiers model that extracts identifiers
from events into a standard schema:
entity_identifier_id -- Unique ID (ent_idfr_ prefix)
event_id -- Links to the originating event
edge_id -- Groups identifiers that belong together
entity_type -- 'person', 'group', etc.
identifier_type -- e.g., 'email', 'domain', 'phone'
identifier_value -- The actual value
role -- Entity's role in the event (optional)
occurred_at -- Event timestamp
source -- Source system name
The edge_id is the key concept: identifiers sharing the same edge_id are
assumed to belong to the same entity. Typically edge_id = event_id when each
event involves a single entity. For events with multiple participants, use a
composite like event_id || identifier_value to avoid linking unrelated
entities.
Example — Gmail person identifiers:
SELECT
{{ nexus.create_nexus_id('entity_identifier', ['event_id', 'email', "'person'", "'sender'"]) }}
as entity_identifier_id,
event_id,
event_id as edge_id,
'person' as entity_type,
'email' as identifier_type,
sender_email as identifier_value,
'sender' as role,
occurred_at,
'gmail' as source
FROM {{ ref('gmail_message_events') }}
WHERE sender_email IS NOT NULL
Sources union their intermediate identifier models into a single
{source}_entity_identifiers model:
-- stripe_entity_identifiers.sql
{{ dbt_utils.union_relations([
ref('stripe_person_identifiers'),
ref('stripe_group_identifiers')
]) }}
Step 1: Identifier Collection
The process_entity_identifiers() macro unions all enabled source identifier
models into nexus_entity_identifiers — a single table containing every
identifier from every source, tagged with entity_type.
This is the input to the ER algorithm. No resolution has happened yet — the same person might appear as dozens of separate rows with different identifiers from different sources.
Step 2: Edge Creation
Model: nexus_entity_identifiers_edges
Edges represent connections between identifiers that should resolve to the same
entity. Two identifiers are connected when they share the same edge_id.
The create_identifier_edges() macro:
- Self-joins
nexus_entity_identifiersonedge_idwithin eachentity_typeto find co-occurring identifier pairs - Deduplicates using a surrogate key on identifier types and values — the same two identifiers connected by 10,000 events produce exactly one edge
- Optionally filters based on edge quality thresholds (see Edge Quality)
-- Conceptual logic (simplified)
SELECT DISTINCT
a.identifier_type as identifier_type_a,
a.identifier_value as identifier_value_a,
b.identifier_type as identifier_type_b,
b.identifier_value as identifier_value_b
FROM nexus_entity_identifiers a
JOIN nexus_entity_identifiers b
ON a.edge_id = b.edge_id
AND a.entity_type = b.entity_type
AND (a.identifier_type != b.identifier_type
OR a.identifier_value != b.identifier_value)
Why Deduplication Matters
Without deduplication, high-frequency entities create massive edge explosion. An entity with 26,000 events and 2 identifier types would generate 26,000 duplicate edges. Across all entities this becomes billions of redundant rows.
With surrogate-key deduplication, only unique identifier relationships are kept:
| Metric | Before | After |
|---|---|---|
| Edges | 1.8M duplicates | 790 unique |
| Processing time | Hours | ~4 seconds |
| Reduction | — | 99.96% |
Step 3: Connected-Components Resolution
Model: nexus_resolved_{type}_identifiers (e.g.,
nexus_resolved_person_identifiers, nexus_resolved_group_identifiers)
This is the core of entity resolution. The resolve_identifiers() macro
implements a depth-bounded connected-components algorithm.
How It's Rendered: Jinja-Unrolled Traversal
Because max_recursion is a dbt var, it's known at Jinja-compile time. We
exploit that to unroll the traversal: render one CTE per hop using a
Jinja for loop, instead of using SQL's WITH RECURSIVE primitive. Each
level uses UNION (or UNION DISTINCT on BigQuery) which dedups
(component, reachable) pairs — cycles are absorbed naturally with no path
tracking required.
This is a real architectural choice, not just style. WITH RECURSIVE can't
spill to disk on Snowflake (causing OOMs at scale) and has structural
restrictions on BigQuery (can't be embedded in surrounding CTEs or UNION ALL
clauses). Plain CTEs avoid both problems and use the warehouse's standard
hash-join machinery, which spills gracefully when memory is tight. See
Why not WITH RECURSIVE? below for the full
rationale.
Phase A: Component Discovery
Every identifier starts as its own component. Each compile-time-unrolled CTE extends reachability one more hop:
WITH depth_0 AS (
-- Base: every identifier is its own component
SELECT DISTINCT
identifier_type as component_identifier_type,
identifier_value as component_identifier_value,
identifier_type,
identifier_value
FROM nexus_entity_identifiers
WHERE entity_type = '{type}'
)
{% for level in range(1, max_recursion + 1) %}
, depth_{{ level }} AS (
-- Hop {{ level }}: extend reachability one more edge
SELECT * FROM depth_{{ level - 1 }}
UNION
SELECT
d.component_identifier_type,
d.component_identifier_value,
e.identifier_type_b,
e.identifier_value_b
FROM depth_{{ level - 1 }} d
JOIN nexus_entity_identifiers_edges e
ON d.identifier_type = e.identifier_type_a
AND d.identifier_value = e.identifier_value_a
)
{% endfor %}
For max_recursion: 3, this renders 4 CTEs (depth_0 through depth_3).
The final depth_N is the set of all (component_identifier, reachable_identifier)
pairs within N hops.
Phase B: Entity ID Assignment
Each identifier is assigned to the lexicographically first identifier in its
connected component. A deterministic entity_id is generated from that
component root:
SELECT
identifier_type,
identifier_value,
{{ nexus.create_nexus_id('entity', [...]) }} as entity_id
FROM depth_{max_recursion}
Example — before and after resolution:
Before:
email: john@company.com → (no entity_id yet)
phone: 555-1234 → (no entity_id yet)
email: john@newcompany.com → (no entity_id yet)
After (all connected via edges):
email: john@company.com → entity_id: ent_a1b2c3...
phone: 555-1234 → entity_id: ent_a1b2c3...
email: john@newcompany.com → entity_id: ent_a1b2c3...
Recursion Depth
Configure with nexus.max_recursion in dbt_project.yml (default: 5). For most
datasets, 3 is sufficient and significantly faster. Each level adds one more
self-join over the edges table; deeper depths have proportionally higher cost.
vars:
nexus:
max_recursion: 3
Why not WITH RECURSIVE?
Earlier versions of this macro used SQL's WITH RECURSIVE primitive with
per-row path-string concatenation for cycle detection. Three problems:
- Snowflake OOMs at scale. The recursive operator can't spill to disk gracefully. On graphs of ~10M+ identifiers, intermediate path-tracking state exceeds per-query memory and the query aborts with error 100298 (Recursive Join ran out of memory).
- BigQuery structural restrictions.
WITH RECURSIVEin BigQuery must be top-level of theWITHclause, can't reference upstream CTEs in its recursive body, and can't beUNION ALL'd with non-recursive CTEs. This forced architectural decisions like splitting person/group resolution into separate models. - Substring-collision bug in Snowflake cycle detection. The cycle
check
not contains(rc.path, type || ':' || value)did a substring match on a concatenated path string. Short identifier values (email=p,phone=512982) appeared as substrings inside longer path strings and incorrectly blocked legitimate traversals, stranding affected identifiers from their true connected components.
The Jinja-unrolled form fixes all three: regular hash joins spill cleanly,
there are no structural restrictions on surrounding SQL, and UNION-based
dedup is set-membership (no substring false positives).
The trade-off is rendered SQL length: at max_recursion=3 it's 4 CTE
blocks instead of 1 recursive block. At higher depths the rendered SQL
grows linearly. Compile time and warehouse SQL-length limits are not
practical constraints at the scales we run.
Step 4: Entity Table
Model: nexus_entities
The finalize_entities() macro creates the final entity table by:
- Pivoting resolved pre-resolution traits (name, email, domain, etc.) from EAV format into columns — trait columns are discovered dynamically at compile time
- Pivoting computed traits (risk_tier, display_name, etc.) from
nexus_computed_traitsinto additional columns — also discovered at compile time - Computing timestamps:
_created_at,_updated_at,_last_merged_at,first_interaction_at,last_interaction_at - Combining ER entities with non-ER registered entities into a single table
The output has one row per entity with trait values (both pre-resolution and computed) as columns. See Entities for the full trait lifecycle.
Step 5: Event Participation
Model: nexus_entity_participants
The finalize_participants() macro links resolved entities back to events:
- Joins
nexus_entity_identifierswith resolved identifiers to map each event's identifiers to their resolvedentity_id - Preserves the
rolefrom Step 0 (e.g.,sender,recipient,contact) - Deduplicates — if the same entity participates in the same event via multiple identifiers, only one participation record is kept per role
This is the bridge table that connects nexus_events to nexus_entities.
Algorithm Properties
Correctness
- Transitivity: If A connects to B and B connects to C, all three get the
same
entity_id - Determinism: Same input always produces the same entity IDs
- Completeness: Every source identifier is resolved
Performance Characteristics
| Component | Complexity | Optimization |
|---|---|---|
| Identifier collection | O(n) events | Table materialization |
| Edge creation | O(u) unique edges | Surrogate-key deduplication |
| Traversal resolution | O(u × d) | Jinja-unrolled per-hop CTEs |
| Entity finalization | O(g) entities | Single-pass trait pivot |
| Participation | O(n) events | Efficient joins |
Where n = events, u = unique edges, d = max traversal depth, g = unique entities.
Relationship to Non-ER Entities
Non-ER entity types (entity_resolution: false) skip Steps 2-3 entirely. They
register directly via register_entities() and are combined with ER entities in
Step 4. Their event participation is resolved by a simpler lookup against the
registration model rather than the full ER pipeline.
See Entity Types for details.
Related Documentation
- Entity Types — When to use ER vs non-ER
- Edge Quality — Preventing over-merge with edge quality thresholds
- Troubleshooting — Debugging resolution issues
- Sources — How to create source identifier models