Troubleshooting Entity Resolution

Debugging entity resolution issues — missing merges, over-merging, and resolution verification.

Entities Not Merging

Symptom: Records that should be the same entity have different entity_id values even though they share a common identifier like email.

Root cause: A source is not emitting the shared identifier in its *_entity_identifiers model. If Gmail emits email as an identifier but Notion only emits it as a trait, there's no edge connecting the two records.

Fix: Ensure all sources that should participate in resolution include the shared identifier in their *_entity_identifiers models.

Debugging Steps

1. Check for duplicate identifiers in the final table:

SELECT
    identifier_value,
    identifier_type,
    COUNT(DISTINCT entity_id) as entity_count,
    STRING_AGG(DISTINCT entity_id, ', ') as entity_ids
FROM nexus_entity_identifiers_to_entity_id
WHERE entity_type = 'person'
  AND identifier_type = 'email'
GROUP BY identifier_value, identifier_type
HAVING COUNT(DISTINCT entity_id) > 1
ORDER BY entity_count DESC

If this returns results, the same email resolved to multiple entities — the merge failed.

2. Check what identifiers each source contributes:

SELECT
    source,
    entity_type,
    identifier_type,
    COUNT(*) as count,
    COUNT(DISTINCT identifier_value) as unique_values
FROM nexus_entity_identifiers
GROUP BY source, entity_type, identifier_type
ORDER BY source, entity_type

Compare sources. If one source is missing an identifier type that others have, that's likely the issue.

3. Check if edges exist between the identifiers:

SELECT *
FROM nexus_entity_identifiers_edges
WHERE entity_type = 'person'
  AND (identifier_value_a = 'problem@email.com'
       OR identifier_value_b = 'problem@email.com')

No results means no edge was created — the identifier isn't co-occurring with other identifiers in any event.


Over-Merging

Symptom: Unrelated entities are merged into one — a company entity contains records from multiple unrelated companies.

Root cause: A promiscuous identifier (shared email, generic domain, placeholder value) is connecting records that shouldn't be linked.

Fix: Use edge quality autofilters or filter the problematic identifier at the source level. See Edge Quality.

Debugging Steps

1. Find entities with too many identifiers:

SELECT
    entity_id,
    entity_type,
    COUNT(DISTINCT identifier_value) as identifier_count,
    COUNT(DISTINCT identifier_type) as type_count
FROM nexus_entity_identifiers_to_entity_id
GROUP BY entity_id, entity_type
HAVING COUNT(DISTINCT identifier_value) > 20
ORDER BY identifier_count DESC

2. Inspect the identifiers of a suspicious entity:

SELECT identifier_type, identifier_value
FROM nexus_entity_identifiers_to_entity_id
WHERE entity_id = 'ent_suspect...'
ORDER BY identifier_type

Look for the identifier that's connecting unrelated records — it's usually a shared email or generic domain.

3. Check edge distributions for that identifier:

SELECT *
FROM edge_distributions
WHERE identifier_value_a = 'suspect@shared.com'

Verifying Resolution is Working

Check that resolution is actually reducing identifier count:

SELECT
    entity_type,
    COUNT(DISTINCT identifier_value) as total_identifiers,
    COUNT(DISTINCT entity_id) as resolved_entities,
    ROUND(1.0 - COUNT(DISTINCT entity_id) * 1.0
          / NULLIF(COUNT(DISTINCT identifier_value), 0), 3) as merge_rate
FROM nexus_entity_identifiers_to_entity_id
GROUP BY entity_type

A merge_rate of 0 means no merging is happening. A rate above 0.5 might indicate over-merging.

Check trait resolution for a specific entity:

SELECT entity_id, trait_name, trait_value
FROM nexus_resolved_entity_traits
WHERE entity_id = 'ent_abc123...'
ORDER BY trait_name

Key Models to Check

Model What It Shows
nexus_entity_identifiers All source identifiers (pre-resolution)
nexus_entity_identifiers_edges Edges connecting identifiers (post-filter)
nexus_resolved_{type}_identifiers Resolution output per entity type
nexus_resolved_entity_traits Resolved traits per entity
nexus_entity_identifiers_to_entity_id Final identifier → entity_id mapping
nexus_entities Final entity table with pivoted traits
edge_distributions Edge quality analysis (unfiltered)

Common Fixes

  • Missing merges: Add the shared identifier to the source's *_entity_identifiers intermediate model
  • Over-merging: Filter the promiscuous identifier at source level, or enable error_autofilter: true in edge quality config
  • Slow resolution: Reduce nexus.max_recursion to 3 (default 5). Check edge count — if edges are in the millions, deduplication may not be working
  • Non-deterministic results: Ensure edge_id is set correctly in source identifier models. Using event_id alone as edge_id when events have multiple participants can create false connections