Entity Resolution Algorithm
How the Nexus entity resolution algorithm works — from source identifiers through edge creation, recursive resolution, and final entity tables.
Entity resolution (ER) is the process of determining that two or more
identifiers refer to the same real-world entity — and merging them into a single
resolved entity_id. It's how Nexus knows that john@company.com in Gmail,
john@company.com in Stripe, and phone number 555-1234 from Notion all belong
to the same person.
Not all entities need ER. Subscriptions, contracts, and other objects that exist
in a single source system skip this entirely — see
Entity Types for when to use ER vs non-ER
registration. This document covers the ER pipeline for entity types configured
with entity_resolution: true.
Pipeline Overview
Source Identifiers → Union → Edge Creation → Recursive Resolution → Entity Table → Participants
| Step | Model | What Happens |
|---|---|---|
| 0 | *_entity_identifiers |
Source formats identifiers into standard schema |
| 1 | nexus_entity_identifiers |
All sources unioned into one table |
| 2 | nexus_entity_identifiers_edges |
Edges created between co-occurring identifiers |
| 3 | nexus_resolved_{type}_identifiers |
Recursive connected-components resolution |
| 4 | nexus_entities |
Final entity table with pivoted traits |
| 5 | nexus_entity_participants |
Events linked to resolved entities |
Each ER entity type (e.g., person, group) is resolved independently for
performance and debuggability.
Step 0: Source Identifier Formatting
Each source creates an *_entity_identifiers model that extracts identifiers
from events into a standard schema:
entity_identifier_id -- Unique ID (ent_idfr_ prefix)
event_id -- Links to the originating event
edge_id -- Groups identifiers that belong together
entity_type -- 'person', 'group', etc.
identifier_type -- e.g., 'email', 'domain', 'phone'
identifier_value -- The actual value
role -- Entity's role in the event (optional)
occurred_at -- Event timestamp
source -- Source system name
The edge_id is the key concept: identifiers sharing the same edge_id are
assumed to belong to the same entity. Typically edge_id = event_id when each
event involves a single entity. For events with multiple participants, use a
composite like event_id || identifier_value to avoid linking unrelated
entities.
Example — Gmail person identifiers:
SELECT
{{ nexus.create_nexus_id('entity_identifier', ['event_id', 'email', "'person'", "'sender'"]) }}
as entity_identifier_id,
event_id,
event_id as edge_id,
'person' as entity_type,
'email' as identifier_type,
sender_email as identifier_value,
'sender' as role,
occurred_at,
'gmail' as source
FROM {{ ref('gmail_message_events') }}
WHERE sender_email IS NOT NULL
Sources union their intermediate identifier models into a single
{source}_entity_identifiers model:
-- stripe_entity_identifiers.sql
{{ dbt_utils.union_relations([
ref('stripe_person_identifiers'),
ref('stripe_group_identifiers')
]) }}
Step 1: Identifier Collection
The process_entity_identifiers() macro unions all enabled source identifier
models into nexus_entity_identifiers — a single table containing every
identifier from every source, tagged with entity_type.
This is the input to the ER algorithm. No resolution has happened yet — the same person might appear as dozens of separate rows with different identifiers from different sources.
Step 2: Edge Creation
Model: nexus_entity_identifiers_edges
Edges represent connections between identifiers that should resolve to the same
entity. Two identifiers are connected when they share the same edge_id.
The create_identifier_edges() macro:
- Self-joins
nexus_entity_identifiersonedge_idwithin eachentity_typeto find co-occurring identifier pairs - Deduplicates using a surrogate key on identifier types and values — the same two identifiers connected by 10,000 events produce exactly one edge
- Optionally filters based on edge quality thresholds (see Edge Quality)
-- Conceptual logic (simplified)
SELECT DISTINCT
a.identifier_type as identifier_type_a,
a.identifier_value as identifier_value_a,
b.identifier_type as identifier_type_b,
b.identifier_value as identifier_value_b
FROM nexus_entity_identifiers a
JOIN nexus_entity_identifiers b
ON a.edge_id = b.edge_id
AND a.entity_type = b.entity_type
AND (a.identifier_type != b.identifier_type
OR a.identifier_value != b.identifier_value)
Why Deduplication Matters
Without deduplication, high-frequency entities create massive edge explosion. An entity with 26,000 events and 2 identifier types would generate 26,000 duplicate edges. Across all entities this becomes billions of redundant rows.
With surrogate-key deduplication, only unique identifier relationships are kept:
| Metric | Before | After |
|---|---|---|
| Edges | 1.8M duplicates | 790 unique |
| Processing time | Hours | ~4 seconds |
| Reduction | — | 99.96% |
Step 3: Recursive Resolution
Model: nexus_resolved_{type}_identifiers (e.g.,
nexus_resolved_person_identifiers, nexus_resolved_group_identifiers)
This is the core of entity resolution. The resolve_identifiers() macro
implements a connected-components algorithm using a recursive CTE:
Phase A: Component Discovery
Every identifier starts as its own component. The recursive CTE walks edges to discover all identifiers reachable from each starting point:
WITH RECURSIVE recursive_components AS (
-- Base: every identifier is its own component
SELECT DISTINCT
identifier_type as component_identifier_type,
identifier_value as component_identifier_value,
identifier_type,
identifier_value,
0 as recursion_level
FROM nexus_entity_identifiers
WHERE entity_type = '{type}'
UNION ALL
-- Recurse: walk to connected identifiers via edges
SELECT
rc.component_identifier_type,
rc.component_identifier_value,
e.identifier_type_b,
e.identifier_value_b,
rc.recursion_level + 1
FROM recursive_components rc
JOIN nexus_entity_identifiers_edges e
ON rc.identifier_type = e.identifier_type_a
AND rc.identifier_value = e.identifier_value_a
WHERE rc.recursion_level < {max_recursion}
)
Phase B: Entity ID Assignment
Each identifier is assigned to the lexicographically first identifier in its
connected component. A deterministic entity_id is generated from that
component root:
SELECT
identifier_type,
identifier_value,
{{ nexus.create_nexus_id('entity', [...]) }} as entity_id
FROM recursive_components
Example — before and after resolution:
Before:
email: john@company.com → (no entity_id yet)
phone: 555-1234 → (no entity_id yet)
email: john@newcompany.com → (no entity_id yet)
After (all connected via edges):
email: john@company.com → entity_id: ent_a1b2c3...
phone: 555-1234 → entity_id: ent_a1b2c3...
email: john@newcompany.com → entity_id: ent_a1b2c3...
Recursion Depth
Configure with nexus.max_recursion in dbt_project.yml (default: 5). For most
datasets, 3 is sufficient and significantly faster. The recursive CTE stops when
no new connections are found or the depth limit is reached.
vars:
nexus:
max_recursion: 3
Step 4: Entity Table
Model: nexus_entities
The finalize_entities() macro creates the final entity table by:
- Pivoting resolved pre-resolution traits (name, email, domain, etc.) from EAV format into columns — trait columns are discovered dynamically at compile time
- Pivoting computed traits (risk_tier, display_name, etc.) from
nexus_computed_traitsinto additional columns — also discovered at compile time - Computing timestamps:
_created_at,_updated_at,_last_merged_at,first_interaction_at,last_interaction_at - Combining ER entities with non-ER registered entities into a single table
The output has one row per entity with trait values (both pre-resolution and computed) as columns. See Entities for the full trait lifecycle.
Step 5: Event Participation
Model: nexus_entity_participants
The finalize_participants() macro links resolved entities back to events:
- Joins
nexus_entity_identifierswith resolved identifiers to map each event's identifiers to their resolvedentity_id - Preserves the
rolefrom Step 0 (e.g.,sender,recipient,contact) - Deduplicates — if the same entity participates in the same event via multiple identifiers, only one participation record is kept per role
This is the bridge table that connects nexus_events to nexus_entities.
Algorithm Properties
Correctness
- Transitivity: If A connects to B and B connects to C, all three get the
same
entity_id - Determinism: Same input always produces the same entity IDs
- Completeness: Every source identifier is resolved
Performance Characteristics
| Component | Complexity | Optimization |
|---|---|---|
| Identifier collection | O(n) events | Table materialization |
| Edge creation | O(u) unique edges | Surrogate-key deduplication |
| Recursive resolution | O(u × d) | Bounded recursion depth |
| Entity finalization | O(g) entities | Single-pass trait pivot |
| Participation | O(n) events | Efficient joins |
Where n = events, u = unique edges, d = max recursion depth, g = unique entities.
Relationship to Non-ER Entities
Non-ER entity types (entity_resolution: false) skip Steps 2-3 entirely. They
register directly via register_entities() and are combined with ER entities in
Step 4. Their event participation is resolved by a simpler lookup against the
registration model rather than the full ER pipeline.
See Entity Types for details.
Related Documentation
- Entity Types — When to use ER vs non-ER
- Edge Quality — Preventing over-merge with edge quality thresholds
- Troubleshooting — Debugging resolution issues
- Sources — How to create source identifier models