Nexus Data

Edges connect identifiers that should resolve to the same entity. When an identifier has too many connections — a shared office email appearing on thousands of records, a placeholder phone number, a generic domain — it causes over-merging: unrelated entities get incorrectly linked together.

Edge quality validation catches this before it corrupts resolution.

The Problem

A single bad identifier can cascade. If info@company.com appears as a billing email on 50 different customer records, entity resolution will merge all 50 customers into one entity. The identifier isn't wrong per se — it's real — but it's too promiscuous to be useful for resolution.

Configuration

Edge quality is configured in dbt_project.yml:

vars:
  nexus:
    edge_quality:
      critical_threshold: 50
      critical_autofilter: false
      error_threshold: 20
      error_autofilter: false
      warning_threshold: 10

Setting	Default	Description
`critical_threshold`	50	Connections above this are CRITICAL severity
`critical_autofilter`	false	Auto-remove edges exceeding critical threshold
`error_threshold`	20	Connections above this are ERROR severity
`error_autofilter`	false	Auto-remove edges exceeding error threshold
`warning_threshold`	10	Connections above this are WARNING severity

Autofilters

When enabled, autofilters remove problematic edges before they enter nexus_entity_identifiers_edges. This means the resolution algorithm never sees them.

How it works:

Total connections are calculated across all sources for each identifier
If an identifier exceeds the threshold, all its edges are removed
The edge_distributions analysis model always shows unfiltered data so you can still investigate

Enable autofilters gradually:

Start with critical_autofilter: true only
Monitor what's being filtered via edge_distributions
Investigate whether filtered identifiers are legitimate
Once stable, consider error_autofilter: true

Analysis: `edge_distributions`

This model shows all edges (unfiltered) with severity classification. Use it to understand your data quality regardless of whether autofilters are enabled.

Columns

Column	Description
`entity_type_a`	Entity type (e.g., `person`)
`identifier_type_a`	Identifier type (e.g., `email`, `phone`)
`identifier_value_a`	The actual identifier value
`unique_connections`	Count of distinct identifiers this connects to
`connected_types`	Comma-separated list of connected identifier types
`source_distribution`	Breakdown by source (e.g., "stripe (20), notion (5)")
`severity`	`CRITICAL`, `ERROR`, `WARNING`, or `OK`

Useful Queries

-- Overview by severity
SELECT severity, COUNT(*) as identifier_count, MAX(unique_connections) as max
FROM edge_distributions
GROUP BY severity
ORDER BY severity DESC

-- Find the worst offenders
SELECT identifier_type_a, identifier_value_a, unique_connections, source_distribution
FROM edge_distributions
WHERE severity IN ('CRITICAL', 'ERROR')
ORDER BY unique_connections DESC

-- Which sources are causing problems?
SELECT source_distribution, severity, COUNT(*) as count
FROM edge_distributions
WHERE severity != 'OK'
GROUP BY source_distribution, severity
ORDER BY count DESC

Tests

The test_edge_quality_thresholds test validates that nexus_entity_identifiers_edges (after autofilters) doesn't contain identifiers exceeding the error threshold.

With autofilters enabled, this test should pass — problematic edges are already removed
Without autofilters, the test fails if any identifier exceeds error_threshold, preventing a bad build from going through

dbt test --select test_edge_quality_thresholds

Fixing Root Causes

Autofilters are a safety net, not a solution. Common root causes:

Shared/generic emails: Office-wide emails like info@, billing@, support@ appearing on many records. Filter these at the source level.
Placeholder values: Test emails, dummy phone numbers. Exclude in source identifier models.
High-cardinality domains: Generic email providers (gmail.com, yahoo.com) used as group identifiers. These should typically be excluded from group identifier models.
Data quality issues: Duplicate or incorrect identifier assignments in source systems.

Fix at the source by filtering in your *_entity_identifiers intermediate models. This is always better than relying on autofilters downstream.