Entity Resolution Algorithm

How the Nexus entity resolution algorithm works — from source identifiers through edge creation, recursive resolution, and final entity tables.

Entity resolution (ER) is the process of determining that two or more identifiers refer to the same real-world entity — and merging them into a single resolved entity_id. It's how Nexus knows that john@company.com in Gmail, john@company.com in Stripe, and phone number 555-1234 from Notion all belong to the same person.

Not all entities need ER. Subscriptions, contracts, and other objects that exist in a single source system skip this entirely — see Entity Types for when to use ER vs non-ER registration. This document covers the ER pipeline for entity types configured with entity_resolution: true.


Pipeline Overview

Source Identifiers → Union → Edge Creation → Recursive Resolution → Entity Table → Participants
Step Model What Happens
0 *_entity_identifiers Source formats identifiers into standard schema
1 nexus_entity_identifiers All sources unioned into one table
2 nexus_entity_identifiers_edges Edges created between co-occurring identifiers
3 nexus_resolved_{type}_identifiers Recursive connected-components resolution
4 nexus_entities Final entity table with pivoted traits
5 nexus_entity_participants Events linked to resolved entities

Each ER entity type (e.g., person, group) is resolved independently for performance and debuggability.


Step 0: Source Identifier Formatting

Each source creates an *_entity_identifiers model that extracts identifiers from events into a standard schema:

entity_identifier_id  -- Unique ID (ent_idfr_ prefix)
event_id              -- Links to the originating event
edge_id               -- Groups identifiers that belong together
entity_type           -- 'person', 'group', etc.
identifier_type       -- e.g., 'email', 'domain', 'phone'
identifier_value      -- The actual value
role                  -- Entity's role in the event (optional)
occurred_at           -- Event timestamp
source                -- Source system name

The edge_id is the key concept: identifiers sharing the same edge_id are assumed to belong to the same entity. Typically edge_id = event_id when each event involves a single entity. For events with multiple participants, use a composite like event_id || identifier_value to avoid linking unrelated entities.

Example — Gmail person identifiers:

SELECT
    {{ nexus.create_nexus_id('entity_identifier', ['event_id', 'email', "'person'", "'sender'"]) }}
        as entity_identifier_id,
    event_id,
    event_id as edge_id,
    'person' as entity_type,
    'email' as identifier_type,
    sender_email as identifier_value,
    'sender' as role,
    occurred_at,
    'gmail' as source
FROM {{ ref('gmail_message_events') }}
WHERE sender_email IS NOT NULL

Sources union their intermediate identifier models into a single {source}_entity_identifiers model:

-- stripe_entity_identifiers.sql
{{ dbt_utils.union_relations([
    ref('stripe_person_identifiers'),
    ref('stripe_group_identifiers')
]) }}

Step 1: Identifier Collection

The process_entity_identifiers() macro unions all enabled source identifier models into nexus_entity_identifiers — a single table containing every identifier from every source, tagged with entity_type.

This is the input to the ER algorithm. No resolution has happened yet — the same person might appear as dozens of separate rows with different identifiers from different sources.


Step 2: Edge Creation

Model: nexus_entity_identifiers_edges

Edges represent connections between identifiers that should resolve to the same entity. Two identifiers are connected when they share the same edge_id.

The create_identifier_edges() macro:

  1. Self-joins nexus_entity_identifiers on edge_id within each entity_type to find co-occurring identifier pairs
  2. Deduplicates using a surrogate key on identifier types and values — the same two identifiers connected by 10,000 events produce exactly one edge
  3. Optionally filters based on edge quality thresholds (see Edge Quality)
-- Conceptual logic (simplified)
SELECT DISTINCT
    a.identifier_type  as identifier_type_a,
    a.identifier_value as identifier_value_a,
    b.identifier_type  as identifier_type_b,
    b.identifier_value as identifier_value_b
FROM nexus_entity_identifiers a
JOIN nexus_entity_identifiers b
    ON a.edge_id = b.edge_id
    AND a.entity_type = b.entity_type
    AND (a.identifier_type != b.identifier_type
         OR a.identifier_value != b.identifier_value)

Why Deduplication Matters

Without deduplication, high-frequency entities create massive edge explosion. An entity with 26,000 events and 2 identifier types would generate 26,000 duplicate edges. Across all entities this becomes billions of redundant rows.

With surrogate-key deduplication, only unique identifier relationships are kept:

Metric Before After
Edges 1.8M duplicates 790 unique
Processing time Hours ~4 seconds
Reduction 99.96%

Step 3: Recursive Resolution

Model: nexus_resolved_{type}_identifiers (e.g., nexus_resolved_person_identifiers, nexus_resolved_group_identifiers)

This is the core of entity resolution. The resolve_identifiers() macro implements a connected-components algorithm using a recursive CTE:

Phase A: Component Discovery

Every identifier starts as its own component. The recursive CTE walks edges to discover all identifiers reachable from each starting point:

WITH RECURSIVE recursive_components AS (
    -- Base: every identifier is its own component
    SELECT DISTINCT
        identifier_type as component_identifier_type,
        identifier_value as component_identifier_value,
        identifier_type,
        identifier_value,
        0 as recursion_level
    FROM nexus_entity_identifiers
    WHERE entity_type = '{type}'

    UNION ALL

    -- Recurse: walk to connected identifiers via edges
    SELECT
        rc.component_identifier_type,
        rc.component_identifier_value,
        e.identifier_type_b,
        e.identifier_value_b,
        rc.recursion_level + 1
    FROM recursive_components rc
    JOIN nexus_entity_identifiers_edges e
        ON rc.identifier_type = e.identifier_type_a
        AND rc.identifier_value = e.identifier_value_a
    WHERE rc.recursion_level < {max_recursion}
)

Phase B: Entity ID Assignment

Each identifier is assigned to the lexicographically first identifier in its connected component. A deterministic entity_id is generated from that component root:

SELECT
    identifier_type,
    identifier_value,
    {{ nexus.create_nexus_id('entity', [...]) }} as entity_id
FROM recursive_components

Example — before and after resolution:

Before:
  email: john@company.com    → (no entity_id yet)
  phone: 555-1234            → (no entity_id yet)
  email: john@newcompany.com → (no entity_id yet)

After (all connected via edges):
  email: john@company.com    → entity_id: ent_a1b2c3...
  phone: 555-1234            → entity_id: ent_a1b2c3...
  email: john@newcompany.com → entity_id: ent_a1b2c3...

Recursion Depth

Configure with nexus.max_recursion in dbt_project.yml (default: 5). For most datasets, 3 is sufficient and significantly faster. The recursive CTE stops when no new connections are found or the depth limit is reached.

vars:
  nexus:
    max_recursion: 3

Step 4: Entity Table

Model: nexus_entities

The finalize_entities() macro creates the final entity table by:

  1. Pivoting resolved pre-resolution traits (name, email, domain, etc.) from EAV format into columns — trait columns are discovered dynamically at compile time
  2. Pivoting computed traits (risk_tier, display_name, etc.) from nexus_computed_traits into additional columns — also discovered at compile time
  3. Computing timestamps: _created_at, _updated_at, _last_merged_at, first_interaction_at, last_interaction_at
  4. Combining ER entities with non-ER registered entities into a single table

The output has one row per entity with trait values (both pre-resolution and computed) as columns. See Entities for the full trait lifecycle.


Step 5: Event Participation

Model: nexus_entity_participants

The finalize_participants() macro links resolved entities back to events:

  1. Joins nexus_entity_identifiers with resolved identifiers to map each event's identifiers to their resolved entity_id
  2. Preserves the role from Step 0 (e.g., sender, recipient, contact)
  3. Deduplicates — if the same entity participates in the same event via multiple identifiers, only one participation record is kept per role

This is the bridge table that connects nexus_events to nexus_entities.


Algorithm Properties

Correctness

  • Transitivity: If A connects to B and B connects to C, all three get the same entity_id
  • Determinism: Same input always produces the same entity IDs
  • Completeness: Every source identifier is resolved

Performance Characteristics

Component Complexity Optimization
Identifier collection O(n) events Table materialization
Edge creation O(u) unique edges Surrogate-key deduplication
Recursive resolution O(u × d) Bounded recursion depth
Entity finalization O(g) entities Single-pass trait pivot
Participation O(n) events Efficient joins

Where n = events, u = unique edges, d = max recursion depth, g = unique entities.


Relationship to Non-ER Entities

Non-ER entity types (entity_resolution: false) skip Steps 2-3 entirely. They register directly via register_entities() and are combined with ER entities in Step 4. Their event participation is resolved by a simpler lookup against the registration model rather than the full ER pipeline.

See Entity Types for details.