Edge Quality
Preventing over-merge in entity resolution with edge quality thresholds and autofilters.
Edges connect identifiers that should resolve to the same entity. When an identifier has too many connections — a shared office email appearing on thousands of records, a placeholder phone number, a generic domain — it causes over-merging: unrelated entities get incorrectly linked together.
Edge quality validation catches this before it corrupts resolution.
The Problem
A single bad identifier can cascade. If info@company.com appears as a billing
email on 50 different customer records, entity resolution will merge all 50
customers into one entity. The identifier isn't wrong per se — it's real — but
it's too promiscuous to be useful for resolution.
Configuration
Edge quality is configured in dbt_project.yml:
vars:
nexus:
edge_quality:
critical_threshold: 50
critical_autofilter: false
error_threshold: 20
error_autofilter: false
warning_threshold: 10
| Setting | Default | Description |
|---|---|---|
critical_threshold |
50 | Connections above this are CRITICAL severity |
critical_autofilter |
false | Auto-remove edges exceeding critical threshold |
error_threshold |
20 | Connections above this are ERROR severity |
error_autofilter |
false | Auto-remove edges exceeding error threshold |
warning_threshold |
10 | Connections above this are WARNING severity |
Autofilters
When enabled, autofilters remove problematic edges before they enter
nexus_entity_identifiers_edges. This means the resolution algorithm never sees
them.
How it works:
- Total connections are calculated across all sources for each identifier
- If an identifier exceeds the threshold, all its edges are removed
- The
edge_distributionsanalysis model always shows unfiltered data so you can still investigate
Enable autofilters gradually:
- Start with
critical_autofilter: trueonly - Monitor what's being filtered via
edge_distributions - Investigate whether filtered identifiers are legitimate
- Once stable, consider
error_autofilter: true
Analysis: edge_distributions
This model shows all edges (unfiltered) with severity classification. Use it to understand your data quality regardless of whether autofilters are enabled.
Columns
| Column | Description |
|---|---|
entity_type_a |
Entity type (e.g., person) |
identifier_type_a |
Identifier type (e.g., email, phone) |
identifier_value_a |
The actual identifier value |
unique_connections |
Count of distinct identifiers this connects to |
connected_types |
Comma-separated list of connected identifier types |
source_distribution |
Breakdown by source (e.g., "stripe (20), notion (5)") |
severity |
CRITICAL, ERROR, WARNING, or OK |
Useful Queries
-- Overview by severity
SELECT severity, COUNT(*) as identifier_count, MAX(unique_connections) as max
FROM edge_distributions
GROUP BY severity
ORDER BY severity DESC
-- Find the worst offenders
SELECT identifier_type_a, identifier_value_a, unique_connections, source_distribution
FROM edge_distributions
WHERE severity IN ('CRITICAL', 'ERROR')
ORDER BY unique_connections DESC
-- Which sources are causing problems?
SELECT source_distribution, severity, COUNT(*) as count
FROM edge_distributions
WHERE severity != 'OK'
GROUP BY source_distribution, severity
ORDER BY count DESC
Tests
The test_edge_quality_thresholds test validates that
nexus_entity_identifiers_edges (after autofilters) doesn't contain identifiers
exceeding the error threshold.
- With autofilters enabled, this test should pass — problematic edges are already removed
- Without autofilters, the test fails if any identifier exceeds
error_threshold, preventing a bad build from going through
dbt test --select test_edge_quality_thresholds
Fixing Root Causes
Autofilters are a safety net, not a solution. Common root causes:
- Shared/generic emails: Office-wide emails like
info@,billing@,support@appearing on many records. Filter these at the source level. - Placeholder values: Test emails, dummy phone numbers. Exclude in source identifier models.
- High-cardinality domains: Generic email providers (
gmail.com,yahoo.com) used as group identifiers. These should typically be excluded from group identifier models. - Data quality issues: Duplicate or incorrect identifier assignments in source systems.
Fix at the source by filtering in your *_entity_identifiers intermediate
models. This is always better than relying on autofilters downstream.