Edge Quality

Preventing over-merge in entity resolution with edge quality thresholds and autofilters.

Edges connect identifiers that should resolve to the same entity. When an identifier has too many connections — a shared office email appearing on thousands of records, a placeholder phone number, a generic domain — it causes over-merging: unrelated entities get incorrectly linked together.

Edge quality validation catches this before it corrupts resolution.


The Problem

A single bad identifier can cascade. If info@company.com appears as a billing email on 50 different customer records, entity resolution will merge all 50 customers into one entity. The identifier isn't wrong per se — it's real — but it's too promiscuous to be useful for resolution.


Configuration

Edge quality is configured in dbt_project.yml:

vars:
  nexus:
    edge_quality:
      critical_threshold: 50
      critical_autofilter: false
      error_threshold: 20
      error_autofilter: false
      warning_threshold: 10
Setting Default Description
critical_threshold 50 Connections above this are CRITICAL severity
critical_autofilter false Auto-remove edges exceeding critical threshold
error_threshold 20 Connections above this are ERROR severity
error_autofilter false Auto-remove edges exceeding error threshold
warning_threshold 10 Connections above this are WARNING severity

Autofilters

When enabled, autofilters remove problematic edges before they enter nexus_entity_identifiers_edges. This means the resolution algorithm never sees them.

How it works:

  1. Total connections are calculated across all sources for each identifier
  2. If an identifier exceeds the threshold, all its edges are removed
  3. The edge_distributions analysis model always shows unfiltered data so you can still investigate

Enable autofilters gradually:

  1. Start with critical_autofilter: true only
  2. Monitor what's being filtered via edge_distributions
  3. Investigate whether filtered identifiers are legitimate
  4. Once stable, consider error_autofilter: true

Analysis: edge_distributions

This model shows all edges (unfiltered) with severity classification. Use it to understand your data quality regardless of whether autofilters are enabled.

Columns

Column Description
entity_type_a Entity type (e.g., person)
identifier_type_a Identifier type (e.g., email, phone)
identifier_value_a The actual identifier value
unique_connections Count of distinct identifiers this connects to
connected_types Comma-separated list of connected identifier types
source_distribution Breakdown by source (e.g., "stripe (20), notion (5)")
severity CRITICAL, ERROR, WARNING, or OK

Useful Queries

-- Overview by severity
SELECT severity, COUNT(*) as identifier_count, MAX(unique_connections) as max
FROM edge_distributions
GROUP BY severity
ORDER BY severity DESC
-- Find the worst offenders
SELECT identifier_type_a, identifier_value_a, unique_connections, source_distribution
FROM edge_distributions
WHERE severity IN ('CRITICAL', 'ERROR')
ORDER BY unique_connections DESC
-- Which sources are causing problems?
SELECT source_distribution, severity, COUNT(*) as count
FROM edge_distributions
WHERE severity != 'OK'
GROUP BY source_distribution, severity
ORDER BY count DESC

Tests

The test_edge_quality_thresholds test validates that nexus_entity_identifiers_edges (after autofilters) doesn't contain identifiers exceeding the error threshold.

  • With autofilters enabled, this test should pass — problematic edges are already removed
  • Without autofilters, the test fails if any identifier exceeds error_threshold, preventing a bad build from going through
dbt test --select test_edge_quality_thresholds

Fixing Root Causes

Autofilters are a safety net, not a solution. Common root causes:

  • Shared/generic emails: Office-wide emails like info@, billing@, support@ appearing on many records. Filter these at the source level.
  • Placeholder values: Test emails, dummy phone numbers. Exclude in source identifier models.
  • High-cardinality domains: Generic email providers (gmail.com, yahoo.com) used as group identifiers. These should typically be excluded from group identifier models.
  • Data quality issues: Duplicate or incorrect identifier assignments in source systems.

Fix at the source by filtering in your *_entity_identifiers intermediate models. This is always better than relying on autofilters downstream.