Nexus Data

Problem Statement

Identity resolution performance and correctness are heavily impacted by poor-quality edges. A single bad identifier that's connected to many other identifiers can cascade through transitive closure, causing:

Massive over-merging: Hundreds or thousands of identifiers incorrectly resolved to the same person
Performance degradation: Exponential blow-up in recursive CTE execution
Late detection: Issues only discovered after expensive resolution runs

Example: A single GHL contact ID appeared in 105 different Segment events with 103 different email addresses. Through transitive closure, this caused 738 identifiers to incorrectly merge into one person.

Current State

Quality issues are only detected after identity resolution completes, through:

Post-resolution diagnostic queries
Manual investigation of suspicious persons
Performance monitoring (slow runs)

This is too late—the expensive recursive CTE has already run and produced incorrect results.

Proposed Solution

Add edge quality checks at edge creation time in nexus_entity_identifiers_edges.sql to flag problematic edges before they enter the resolution process.

Distribution Analysis

Create a histogram showing how many unique identifier_values each identifier is connected to:

-- Flag identifiers with suspiciously high connection counts
with edge_distribution as (
    select
        entity_type_a,
        identifier_type_a,
        identifier_value_a,
        count(distinct identifier_value_b) as unique_connections
    from {{ ref('nexus_entity_identifiers_edges') }}
    group by entity_type_a, identifier_type_a, identifier_value_a
)

select * from edge_distribution
where unique_connections > {{ var('nexus_max_connections_threshold', 10) }}
order by unique_connections desc

Thresholds

Most legitimate identifiers connect to 2-3 others (e.g., email ↔ phone ↔ patient_id). Connections >10 are suspicious:

Normal: 1-5 connections
Warning: 6-10 connections
Error: >10 connections (block edge creation or flag for review)

Implementation Options

Option 1: Blocking Check (Recommended for First Pass)

Create a dbt test that fails if any identifier exceeds thresholds:

-- tests/test_edge_quality_thresholds.sql
select
    entity_type_a,
    identifier_type_a,
    identifier_value_a,
    unique_connections
from {{ ref('nexus_entity_identifiers_edges_distribution') }}
where unique_connections > 10

Config in dbt_project.yml:

tests:
  gameday:
    test_edge_quality_thresholds:
      +severity: error

Option 2: Non-Blocking Diagnostic Model

Create an analysis model that flags problematic edges for review:

-- models/analysis/edge_quality_check.sql
{{ config(materialized='table', tags=['analysis']) }}

with edge_distribution as (
    select
        entity_type_a,
        identifier_type_a,
        identifier_value_a,
        count(distinct identifier_value_b) as unique_connections,
        listagg(distinct identifier_type_b, ', ') as connected_types
    from {{ ref('nexus_entity_identifiers_edges') }}
    group by entity_type_a, identifier_type_a, identifier_value_a
)

select
    entity_type_a,
    identifier_type_a,
    identifier_value_a,
    unique_connections,
    connected_types,
    case
        when unique_connections > 50 then 'CRITICAL'
        when unique_connections > 20 then 'ERROR'
        when unique_connections > 10 then 'WARNING'
        else 'OK'
    end as severity
from edge_distribution
where unique_connections > 10
order by severity desc, unique_connections desc

Option 3: Statistical Analysis

Use z-scores to detect outliers within each identifier type:

with edge_distribution as (
    select
        entity_type_a,
        identifier_type_a,
        identifier_value_a,
        count(distinct identifier_value_b) as unique_connections
    from {{ ref('nexus_entity_identifiers_edges') }}
    group by entity_type_a, identifier_type_a, identifier_value_a
),

type_statistics as (
    select
        entity_type_a,
        identifier_type_a,
        avg(unique_connections) as avg_connections,
        stddev(unique_connections) as stddev_connections
    from edge_distribution
    group by entity_type_a, identifier_type_a
),

flagged_edges as (
    select
        ed.*,
        ts.avg_connections,
        ts.stddev_connections,
        (ed.unique_connections - ts.avg_connections) / nullif(ts.stddev_connections, 0) as z_score
    from edge_distribution ed
    join type_statistics ts
        on ed.entity_type_a = ts.entity_type_a
        and ed.identifier_type_a = ts.identifier_type_a
)

select * from flagged_edges
where z_score > 3  -- Statistical outlier
order by z_score desc

Expected Impact

Data Quality

Early detection: Catch problematic edges before they cascade through resolution
Root cause identification: Find sources with invalid identifier mappings
Validation: Ensure edge creation logic is working correctly

Performance

Faster resolution: Fewer bad edges mean smaller graphs and faster recursion
Controlled recursion depth: Prevent exponential expansion from problematic connections
Resource efficiency: Identify issues before consuming large compute resources

Operational

Automated checks: Integrate into CI/CD to catch issues early
Actionable alerts: Flag specific identifiers that need investigation
Historical tracking: Monitor edge quality trends over time

Success Criteria

Detect identifiers with >10 connections before resolution runs
Identify Segment GHL_ID issue through automated check
Block or warn on edge creation exceeding thresholds
Reduce over-merged person resolution by 95%+
Provide clear root cause for flagged edges

Implementation Plan

Phase 1: Analysis Model (Week 1)

Create models/analysis/edge_quality_check.sql to manually inspect edge distribution patterns.

Phase 2: Automated Tests (Week 2)

Add dbt tests that fail if threshold violations are detected during edge creation.

Phase 3: Source Attribution (Week 3)

Enhance check to identify which source contributed the problematic edges.

Phase 4: Filtering (Future)

Consider adding optional edge filtering logic to exclude suspicious connections automatically.

Edge Validation Testing Strategy - Post-resolution validation
Reduce Recursive CTE Performance - Performance optimization
Identity Resolution Architecture - Algorithm details