Gmail Template Source
Ready-to-use Gmail integration for email events, person identifiers, and group relationships
The Gmail Template Source provides instant integration with Gmail data through the Nango ETL pipeline. It processes email messages into events, extracts entity identifiers for both people and groups, and creates relationship declarations - all through simple configuration.
Overview
The Gmail template source transforms raw Gmail message data into the nexus framework using a split architecture that handles per-account normalization and cross-account aggregation:
- ๐ง Email Events: Each message becomes a
message_sentevent; each thread becomes athread startedevent - ๐ค Person Entities: Extracts email addresses from senders and recipients
- ๐ข Group Entities: Creates groups from non-generic email domains
- ๐ Relationships: Links people to their organizations via email domains (membership type)
- ๐ Cross-Account Deduplication: Uses
Message-IDheader (message_id_header) to deduplicate emails across multiple Gmail accounts - ๐งต Thread Grouping: Uses
first_message_id_headerto group messages into threads across accounts - ๐จ Participant Roles: Distinguishes between
sender,recipient,cced, andbccedparticipants - ๐ Account Mapping: Maintains dictionaries mapping accounts to
Gmail-specific IDs (
gmail_message_ids,gmail_thread_ids)
Quick Start
1. Enable the Template Source
# dbt_project.yml
vars:
nexus:
gmail:
enabled: true
2. Run the Models
dbt run --select package:nexus
3. Explore Your Data
-- View recent Gmail events
SELECT * FROM {{ ref('nexus_events') }}
WHERE source = 'gmail'
ORDER BY occurred_at DESC
LIMIT 10;
-- See email participants with entity information
SELECT
ev.event_description as subject,
ev.occurred_at,
e.name,
e.email,
ei.role
FROM {{ ref('nexus_events') }} ev
JOIN {{ ref('nexus_entity_identifiers') }} ei ON ev.event_id = ei.event_id
JOIN {{ ref('nexus_entities') }} e ON ei.identifier_value = e.email
WHERE ev.source = 'gmail'
AND e.entity_type = 'person'
ORDER BY ev.occurred_at DESC
LIMIT 20;
Configuration
Basic Configuration
# dbt_project.yml
vars:
nexus:
gmail:
enabled: true
# Uses defaults: schema=gmail, table=messages
Custom Source Location
vars:
nexus:
gmail:
enabled: true
location:
schema: my_email_data
table: gmail_messages
Required Global Variables
vars:
# Required: Define internal email domains
internal_domains:
- 'yourcompany.com'
- 'subsidiary.com'
# Optional: Test email addresses
test_emails:
- 'test@yourcompany.com'
Data Requirements
Source Table Schema
Your Gmail source table must use the STANDARD_TABLE_SCHEMA structure:
CREATE TABLE `project.schema.table` (
_raw_record JSON NOT NULL, -- Gmail message as JSON
_ingested_at TIMESTAMP NOT NULL, -- When the record was ingested
_connection_id STRING NOT NULL, -- Nango connection ID
_stream_id STRING NOT NULL, -- Stream identifier (e.g., 'default')
_sync_timestamp TIMESTAMP, -- Timestamp-based cursor for incremental sync
_sync_token STRING -- Token-based cursor (history ID)
);
Gmail Message JSON Structure
The _raw_record column should contain Gmail API message format:
{
"id": "message_id_123",
"threadId": "thread_id_456",
"internalDate": "1609459200000",
"headers": [
{
"name": "Message-ID",
"value": "<message-id-abc123@mail.gmail.com>"
},
{
"name": "Subject",
"value": "Meeting Follow-up"
},
{
"name": "From",
"value": "John Doe <john@company.com>"
},
{
"name": "To",
"value": "jane@client.com, bob@partner.com"
},
{
"name": "Cc",
"value": "team@company.com"
},
{
"name": "Bcc",
"value": "archive@company.com"
}
],
"body_text": "Thanks for the great meeting..."
}
Key Fields:
id: Gmail-specific message ID (unique within a single account)threadId: Thread identifierinternalDate: Unix timestamp in millisecondsheaders: Array of header objects (must includeMessage-IDfor deduplication)body_text: Plain text body content
Generated Models
Gmail uses a split architecture with per-account and cross-account layers for optimal performance and cross-account deduplication:
Layer 1: Base - gmail_messages_base_dedupped.sql
Simple passthrough with basic deduplication:
Key Features:
- Direct
SELECT *from source table with deduplication - Minimal transformation overhead
- Uses
QUALIFYto keep latest record per Gmail message ID per account
Layer 2: Normalized - Split Architecture
The normalized layer is split into two tiers:
Data Model Relationships
erDiagram
gmail_messages ||--o{ gmail_message_participants : "has"
gmail_threads ||--o{ gmail_messages : "contains"
gmail_threads ||--o{ gmail_thread_participants : "has"
gmail_messages {
string message_id PK "Message-ID header"
string thread_id FK "From first_message_id_header"
timestamp sent_at
string subject
string gmail_message_ids "JSON dict"
string gmail_thread_ids "JSON dict"
string label_ids "JSON dict"
array all_label_ids
}
gmail_message_participants {
string message_id FK "Message-ID header"
string email
string role "sender/recipient/cced/bcced"
string domain
timestamp sent_at
}
gmail_threads {
string thread_id PK "first_message_id_header"
string subject
timestamp first_message_sent_at
timestamp last_message_sent_at
string gmail_thread_ids "JSON dict"
string label_ids "JSON dict"
array all_label_ids
}
gmail_thread_participants {
string thread_id FK "first_message_id_header"
string email
array roles
string domain
timestamp first_participated_at
timestamp last_participated_at
}
Per-Account Models (gmail_normalized_by_account/)
These models normalize data using Gmail-specific IDs (per account):
-
gmail_messages_by_account.sql: Per-account message normalization- Uses
gmail_message_id(from$.id) as primary key - Extracts
message_id_header(Message-ID header) for cross-account linking - Extracts
gmail_thread_id(from$.threadId) - Extracts subject, labels, snippet, size_estimate
- Extracts automation/bulk header fields into dedicated columns
- Cleans subject by removing RE:/FWD: prefixes
- Decodes HTML entities in snippet
- Uses
-
gmail_threads_by_account.sql: Per-account thread aggregation- Groups messages by
gmail_thread_idand_account - Aggregates thread metadata (subject, timestamps, message counts)
- Creates
first_message_id_headerfor cross-account thread linking - Aggregates
label_idsfrom all messages in thread
- Groups messages by
-
gmail_message_participants_by_account.sql: Per-account participant extraction- One row per participant per message per account
- Uses
gmail_message_idfor linking - Extracts and normalizes email addresses from headers
-
gmail_thread_participants_by_account.sql: Per-account thread participant aggregation- Groups participants by
gmail_thread_idand_account - Aggregates roles and participation timestamps
- Groups participants by
Cross-Account Models
These models aggregate per-account data using cross-account identifiers:
gmail_messages.sql: Cross-account message aggregation- Groups by
message_id_header(Message-ID header) - Creates dictionaries mapping
_stream_idtogmail_message_idandgmail_thread_id - Aggregates
label_idsper account and createsall_label_idsarray - Carries forward automation/bulk header columns and derived boolean
- Joins with threads to get
thread_idfromfirst_message_id_header - Keeps latest values using
ARRAY_AGGwithORDER BY sent_at DESC
- Groups by
Key Fields:
message_id -- Primary key from message_id_header (Message-ID)
thread_id -- Thread ID from first_message_id_header
sent_at -- Latest sent_at timestamp
subject -- Cleaned subject (latest)
raw_subject -- Original subject (latest)
in_reply_to -- In-Reply-To header (latest)
auto_submitted_header -- Auto-Submitted header value
precedence_header -- Precedence header value (bulk/list/junk)
list_id_header -- List-Id header value
list_unsubscribe_header -- List-Unsubscribe header value
x_auto_response_suppress_header -- X-Auto-Response-Suppress header value
x_autoreply_header -- X-Autoreply header value
x_autorespond_header -- X-Autorespond header value
is_automated_or_bulk_message -- Derived boolean from automation/bulk headers
snippet -- Message snippet (HTML decoded)
size_estimate -- Message size estimate
gmail_message_ids -- JSON dict: {"account": "gmail_message_id"}
gmail_thread_ids -- JSON dict: {"account": "gmail_thread_id"}
label_ids -- JSON dict: {"account": ["label1", "label2"]}
all_label_ids -- Array of all unique labels across accounts
raw_record -- Full JSON record (latest)
source -- "gmail"
_ingested_at -- Latest ingestion timestamp
Automation/Bulk Header Fields
The Gmail normalized message models expose common RFC/vendor headers that help distinguish personal emails from list mail, notifications, and auto-generated messages.
Available columns:
auto_submitted_headerprecedence_headerlist_id_headerlist_unsubscribe_headerx_auto_response_suppress_headerx_autoreply_headerx_autorespond_headeris_automated_or_bulk_message(derived boolean)
Current boolean logic marks a message as automated/bulk when any of these are true:
auto_submitted_headerstarts withauto-precedence_headeris one ofbulk,list, orjunk- any of
list_id_header,list_unsubscribe_header,x_auto_response_suppress_header,x_autoreply_header,x_autorespond_headeris present and non-empty
Example usage:
SELECT
message_id,
sent_at,
subject,
is_automated_or_bulk_message,
auto_submitted_header,
precedence_header
FROM {{ ref('gmail_messages') }}
WHERE is_automated_or_bulk_message = FALSE
ORDER BY sent_at DESC
LIMIT 50;
gmail_threads.sql: Cross-account thread aggregation- Groups by
first_message_id_header(from per-account threads) - Creates
gmail_thread_idsdictionary mapping accounts to Gmail thread IDs - Aggregates
label_idsper account and createsall_label_idsarray - Aggregates thread metadata (subject, timestamps)
- Groups by
Key Fields:
thread_id -- Primary key from first_message_id_header
subject -- Thread subject (from first message)
raw_subject -- Original subject (from first message)
first_message_sent_at -- Earliest message timestamp
last_message_sent_at -- Latest message timestamp
gmail_thread_ids -- JSON dict: {"account": "gmail_thread_id"}
label_ids -- JSON dict: {"account": ["label1", "label2"]}
all_label_ids -- Array of all unique labels across accounts
_ingested_at -- Earliest ingestion timestamp
-
gmail_message_participants.sql: Cross-account participant aggregation- Groups by
message_id_header,email, androle - Joins with
gmail_messages_by_accountto getmessage_id_header - Keeps latest participant information per message
- Groups by
-
gmail_thread_participants.sql: Cross-account thread participant aggregation- Groups by
thread_id(fromfirst_message_id_header),email, anddomain - Aggregates roles and participation timestamps across accounts
- Groups by
Layer 3: Intermediate - Events and Identifiers
-
gmail_message_events.sql: Message events with full metadata- Creates
message sentevents fromgmail_messages - Includes all message fields (gmail_message_ids, gmail_thread_ids, labels, snippet, etc.)
- Creates
-
gmail_thread_events.sql: Thread started events- Creates
thread startedevents fromgmail_threads - Uses
first_message_sent_atasoccurred_at
- Creates
-
gmail_message_person_identifiers.sql: Person identifiers from message participants -
gmail_message_group_identifiers.sql: Group (domain) identifiers from message participants -
gmail_thread_person_identifiers.sql: Person identifiers from thread participants -
gmail_thread_group_identifiers.sql: Group (domain) identifiers from thread participants
Key Pattern: Event models read from normalized cross-account models.
Identifier models read from participant models and create identifiers linked to
events via event_id.
Layer 4: Union - Nexus Integration
These models feed directly into the nexus pipeline:
gmail_events
Unified events from messages and threads:
event_id -- Unique event identifier (evt_ prefix)
event_name -- "message sent" or "thread started"
occurred_at -- Email send time (sent_at) or thread start (first_message_sent_at)
event_description -- Email subject or thread subject
event_type -- "email"
source -- "gmail"
source_table -- "gmail_message_events" or "gmail_thread_events"
_ingested_at -- Ingestion timestamp
-- Additional fields from messages/threads
message_id -- Message-ID header (for message events)
thread_id -- Thread identifier
gmail_message_ids -- Account mapping dictionary
gmail_thread_ids -- Account mapping dictionary
label_ids -- Account label mapping
all_label_ids -- All unique labels
gmail_entity_identifiers
Unified person + group identifiers from messages and threads:
entity_identifier_id -- Unique identifier (ent_idfr_ prefix)
event_id -- Reference to email/thread event
edge_id -- Groups related identifiers
entity_type -- "person" or "group"
identifier_type -- "email" or "domain"
identifier_value -- Email address or domain
role -- Participation role
occurred_at -- Email/thread timestamp
source -- "gmail"
_ingested_at -- Ingestion timestamp
Role Types:
- Person roles:
sender,recipient,cced,bcced - Group roles:
sender,recipient,cced,bcced(for domain tracking)
Note: Traits and relationship declarations are not generated for Gmail (as per user requirements).
Filtered Generic Domains:
- gmail.com, yahoo.com, hotmail.com, outlook.com
- aol.com, icloud.com, me.com, live.com, msn.com
- googlemail.com, ymail.com, rocketmail.com, protonmail.com
- mail.com, zoho.com
Cross-Account Architecture
Gmail uses a split architecture to handle cross-account deduplication and aggregation:
Per-Account Normalization
First, data is normalized per account using Gmail-specific IDs:
gmail_messages_by_account: Usesgmail_message_id(from$.id) as primary keygmail_threads_by_account: Usesgmail_thread_id(from$.threadId) as primary key- Extracts
message_id_header(Message-ID header) for cross-account linking - Extracts
first_message_id_headerfrom threads for cross-account thread grouping
Cross-Account Aggregation
Then, data is aggregated across accounts using standard email identifiers:
- Messages: Grouped by
message_id_header(Message-ID header) - Threads: Grouped by
first_message_id_header(from the first message in thread) - Participants: Grouped by
message_id_header,email, androle
Message-ID Header for Deduplication
- Primary Key:
message_id_header(Message-ID header value) for messages - Deduplication: Messages with the same
Message-IDacross different Gmail accounts are treated as the same email - Use Case: Same email appearing in sender's and all recipients' inboxes
Thread Grouping
- Thread Key:
first_message_id_header(Message-ID of the first message in thread) - Grouping: All messages in a thread share the same
first_message_id_header - Cross-Account: Threads are linked across accounts using this identifier
Account Mapping Dictionaries
The cross-account models maintain dictionaries to map accounts to Gmail-specific IDs:
gmail_message_ids:{"account1@example.com": "gmail_id_1", "account2@example.com": "gmail_id_2"}gmail_thread_ids:{"account1@example.com": "thread_id_1", "account2@example.com": "thread_id_2"}label_ids:{"account1@example.com": ["INBOX", "SENT"], "account2@example.com": ["INBOX"]}
This allows you to:
- Track which Gmail account a message/thread belongs to
- Access account-specific labels
- Link back to per-account data when needed
Why Message-ID Instead of Gmail ID?
Gmail message IDs ($.id) are account-specific. The same email has different
IDs when it appears in:
- Sender's Sent folder
- Recipient's Inbox
- CC recipient's Inbox
Message-ID is standardized by RFC 2822 and is consistent across all accounts,
making it the perfect deduplication key for cross-account aggregation.
Integration Examples
Customer Communication Timeline
-- View all email communication with a specific customer
WITH customer AS (
SELECT entity_id, email, name
FROM {{ ref('nexus_entities') }}
WHERE email = 'customer@client.com'
AND entity_type = 'person'
)
SELECT
e.occurred_at,
e.event_description as subject,
sender.email as from_email,
sender.name as from_name,
customer.email as customer_email,
ei.role as customer_role
FROM {{ ref('nexus_events') }} e
JOIN {{ ref('nexus_entity_identifiers') }} ei ON e.event_id = ei.event_id
JOIN customer ON ei.identifier_value = customer.email
LEFT JOIN {{ ref('nexus_entities') }} sender
ON sender.entity_type = 'person'
AND EXISTS (
SELECT 1 FROM {{ ref('nexus_entity_identifiers') }} ei2
WHERE ei2.event_id = e.event_id
AND ei2.identifier_value = sender.email
AND ei2.role = 'sender'
)
WHERE e.source = 'gmail'
ORDER BY e.occurred_at DESC;
Email Domain Analysis
-- Analyze email communication by domain with relationship data
SELECT
g.name as domain,
COUNT(DISTINCT e.event_id) as email_count,
COUNT(DISTINCT p.entity_id) as unique_people,
COUNT(DISTINCT CASE WHEN ei.role = 'sender' THEN p.entity_id END) as senders,
COUNT(DISTINCT CASE WHEN ei.role = 'recipient' THEN p.entity_id END) as recipients,
MIN(e.occurred_at) as first_contact,
MAX(e.occurred_at) as last_contact
FROM {{ ref('nexus_events') }} e
JOIN {{ ref('nexus_entity_identifiers') }} gei
ON e.event_id = gei.event_id
AND gei.entity_type = 'group'
JOIN {{ ref('nexus_entities') }} g
ON gei.identifier_value = g.domain
AND g.entity_type = 'group'
JOIN {{ ref('nexus_relationships') }} r
ON r.entity_b_id = g.entity_id
JOIN {{ ref('nexus_entities') }} p
ON r.entity_a_id = p.entity_id
AND p.entity_type = 'person'
JOIN {{ ref('nexus_entity_identifiers') }} ei
ON ei.event_id = e.event_id
AND ei.identifier_value = p.email
WHERE e.source = 'gmail'
GROUP BY g.entity_id, g.name
ORDER BY email_count DESC;
Participant Role Analysis
-- Analyze email participation patterns by role
SELECT
role,
COUNT(DISTINCT message_id) as message_count,
COUNT(DISTINCT email) as unique_participants,
COUNT(*) as total_participations
FROM {{ ref('gmail_message_participants') }}
WHERE sent_at >= CURRENT_DATE - INTERVAL 30 DAY
GROUP BY role
ORDER BY total_participations DESC;
Thread Conversation Analysis
-- Analyze email threads with participant counts
SELECT
t.thread_id,
t.subject,
t.first_message_sent_at,
t.last_message_sent_at,
COUNT(DISTINCT m.message_id) as message_count,
COUNT(DISTINCT tp.email) as unique_participants,
t.gmail_thread_ids
FROM {{ ref('gmail_threads') }} t
LEFT JOIN {{ ref('gmail_messages') }} m ON t.thread_id = m.thread_id
LEFT JOIN {{ ref('gmail_thread_participants') }} tp ON t.thread_id = tp.thread_id
WHERE t.first_message_sent_at >= CURRENT_DATE - INTERVAL 90 DAY
GROUP BY t.thread_id, t.subject, t.first_message_sent_at, t.last_message_sent_at, t.gmail_thread_ids
HAVING message_count > 1
ORDER BY t.first_message_sent_at DESC;
Cross-Account Message Tracking
-- Find messages that appear in multiple accounts
SELECT
message_id,
subject,
sent_at,
JSON_EXTRACT_SCALAR(gmail_message_ids, '$.account1@example.com') as account1_gmail_id,
JSON_EXTRACT_SCALAR(gmail_message_ids, '$.account2@example.com') as account2_gmail_id,
all_label_ids
FROM {{ ref('gmail_messages') }}
WHERE JSON_EXTRACT(gmail_message_ids, '$') IS NOT NULL
AND ARRAY_LENGTH(JSON_EXTRACT_ARRAY(JSON_OBJECT_KEYS(gmail_message_ids))) > 1
ORDER BY sent_at DESC;
Performance Considerations
Incremental Processing
For large Gmail datasets:
# dbt_project.yml
models:
nexus:
sources:
gmail:
+materialized: incremental
+unique_key: message_id
+cluster_by: ['sent_at']
Partitioning (BigQuery)
Optimize time-based queries:
models:
nexus:
sources:
gmail:
+partition_by:
{ 'field': 'sent_at', 'data_type': 'timestamp', 'granularity': 'day' }
Troubleshooting
Common Issues
1. No Gmail events appearing
- Check
nexus.gmail.enabled: trueis set - Verify source table exists with STANDARD_TABLE_SCHEMA structure
- Ensure
Message-IDheader is present in raw JSON records - Verify source data has the expected JSON structure
2. Missing email participants
- Check that
From,To,Cc,Bccheaders are not null/empty in source data - Verify email parsing macros are working correctly
- Ensure participant normalization is enabled
3. Generic domains appearing as groups
- Review the generic domain filter list
- Add additional generic domains if needed
- Verify
filter_non_generic_domains()macro is working
4. Duplicate emails across accounts
- Verify
Message-IDheader is being extracted correctly - Check that deduplication QUALIFY clause is working
- Ensure all messages have valid
Message-IDheaders
Debugging Queries
-- Check raw source data
SELECT
_raw_record,
_ingested_at,
_connection_id,
_stream_id
FROM {{ source('gmail', 'messages') }}
LIMIT 5;
-- Verify base model processing
SELECT * FROM {{ ref('gmail_messages_base') }}
LIMIT 10;
-- Check normalized messages with deduplication
SELECT
message_id,
gmail_message_id,
subject,
sent_at,
thread_id
FROM {{ ref('gmail_messages') }}
ORDER BY sent_at DESC
LIMIT 10;
-- Verify participants extraction
SELECT
message_id,
email,
role,
domain
FROM {{ ref('gmail_message_participants') }}
LIMIT 20;
-- Check Message-ID extraction
SELECT
message_id,
gmail_message_id,
message_id_header,
COUNT(*) as count
FROM {{ ref('gmail_messages') }}
GROUP BY message_id, gmail_message_id, message_id_header
HAVING count > 1;
Advanced Configuration
Custom Email Filtering
Filter specific messages or domains:
-- Filter to external emails only
SELECT * FROM {{ ref('gmail_messages') }} m
WHERE EXISTS (
SELECT 1 FROM {{ ref('gmail_message_participants') }} p
WHERE p.message_id = m.message_id
AND p.domain NOT IN ('yourcompany.com')
);
Header Extraction
Add custom header extraction:
-- Extract custom headers in normalized model
(SELECT JSON_EXTRACT_SCALAR(header, '$.value')
FROM UNNEST(JSON_EXTRACT_ARRAY(_raw_record, '$.headers')) as header
WHERE LOWER(JSON_EXTRACT_SCALAR(header, '$.name')) = 'custom-header'
LIMIT 1) as custom_header_value
Migration Guide
From Custom Gmail Models
If you have existing custom Gmail models:
- Backup Current Models: Save your existing logic
- Compare Schemas: Ensure data compatibility with STANDARD_TABLE_SCHEMA
- Verify Message-ID: Ensure your source data includes
Message-IDin headers - Enable Template Source: Configure to point to your data
- Test Output: Verify data quality, deduplication, and completeness
- Update References: Change refs to use template models
- Remove Custom Models: Clean up old source models
Schema Compatibility
Ensure your Gmail data includes:
- Message
idandthreadId internalDatefor timestampheadersarray with at leastMessage-ID,From,Toheadersbody_textfor email content- Optional:
Cc,Bccheaders for participant extraction
STANDARD_TABLE_SCHEMA Migration
If migrating from old schema with record and synced_at:
-- Example migration query
CREATE OR REPLACE TABLE `new_table` AS
SELECT
old_record as _raw_record,
synced_at as _ingested_at,
'connection_123' as _connection_id,
'default' as _stream_id,
synced_at as _sync_timestamp,
NULL as _sync_token
FROM `old_table`;
Handling Missing Message-ID
If some messages lack Message-ID headers:
-- Option 1: Use Gmail ID as fallback
COALESCE(
message_id_header,
CONCAT('gmail_', gmail_message_id)
) as message_id
-- Option 2: Filter out messages without Message-ID
WHERE message_id_header IS NOT NULL
Ready to integrate Gmail? Set nexus.gmail.enabled: true and run
dbt build!