MX Record Lookups, DNS Timeouts, and SMTP Verification: Building Robust Validation
Your Validation Pipeline Will Break. Plan for It.
You ship an email validation integration. It works perfectly in staging. Then production happens: a DNS resolver takes 12 seconds to respond, Gmail returns 250 OK for a mailbox that doesn’t exist, and a greylisted server rejects your first three connection attempts before accepting the fourth.
About 13.5% of DNS queries fail in the wild, according to a 2022 study analyzing 2.8 billion queries at Chinese ISPs (published via APNIC). MX record lookups are among the more reliable query types, but “more reliable” still means 1-3% of your lookups will time out or return garbage on any given day. Layer SMTP verification on top, and failure modes multiply fast.
This post walks through every failure point in the MX-to-SMTP validation pipeline and shows you how to handle each one. Working code in Node.js and Python. No hand-waving.
What Is an MX Record Lookup?
An MX record lookup is a DNS query that asks for a domain’s mail exchange (MX) records: the hostnames that accept mail for that domain, each tagged with a preference value. The mail sender tries the lowest preference value first, since lower numbers rank higher. For email validation, that lookup is the first real signal. No MX records (and no A/AAAA fallback) usually means the address can’t receive mail at all.
MX Lookups: What Goes Wrong
An MX lookup is a DNS query. Simple in theory. You ask for the mail exchange records of a domain, get back a list of hostnames with priority values, and connect to the lowest-priority (highest preference) server. Three things break this.
Timeouts
The default DNS timeout in most resolvers is 5 seconds, with 2-3 retries. RFC 5321 specifies per-command SMTP timeouts (5 minutes for MAIL and RCPT, 10 minutes for the end of DATA), but your DNS layer times out long before that matters. In practice, MX lookups complete in 50-200ms when things are healthy. When they’re not, you’re waiting 5-15 seconds per retry attempt before the resolver gives up.
That’s fine for a background job. For a signup form? Your user left.
// Node.js: MX lookup with tight timeout
const { Resolver } = require("dns").promises;
async function lookupMX(domain, timeoutMs = 3000) {
const resolver = new Resolver();
const timer = setTimeout(() => resolver.cancel(), timeoutMs);
try {
const records = await resolver.resolveMx(domain);
clearTimeout(timer);
return records.sort((a, b) => a.priority - b.priority);
} catch (err) {
clearTimeout(timer);
if (err.code === "ECANCELLED" || err.code === "ETIMEOUT") {
return { error: "timeout", domain };
}
if (err.code === "ENOTFOUND" || err.code === "ENODATA") {
return { error: "no_mx", domain };
}
throw err;
}
}
# Python: MX lookup with timeout
import dns.resolver
from dns.exception import Timeout, DNSException
def lookup_mx(domain: str, timeout_sec: float = 3.0) -> dict:
resolver = dns.resolver.Resolver()
resolver.timeout = timeout_sec
resolver.lifetime = timeout_sec
try:
answers = resolver.resolve(domain, "MX")
records = sorted(
[{"host": str(r.exchange), "priority": r.preference} for r in answers],
key=lambda r: r["priority"],
)
return {"status": "ok", "records": records}
except Timeout:
return {"status": "timeout", "domain": domain}
except dns.resolver.NXDOMAIN:
return {"status": "no_domain", "domain": domain}
except dns.resolver.NoAnswer:
return {"status": "no_mx", "domain": domain}
except DNSException as e:
return {"status": "error", "detail": str(e)}
Set your timeout to 3 seconds for real-time validation. Background jobs can afford 10-15 seconds. Never use the system default blindly. Check your resolver config.
NXDOMAIN vs No MX Records
Two different failures that look similar but mean different things. NXDOMAIN means the domain doesn’t exist at all. NODATA (or an empty MX response) means the domain exists but has no mail exchange records.
RFC 5321 Section 5.1 says to fall back to the domain’s A/AAAA record when no MX record exists. Some legitimate domains rely on this. But in production, no MX record is a strong signal the address is invalid. About 94% of domains without MX records don’t accept mail at all.
Resolver Poisoning and Stale Caches
Your application’s DNS cache can serve stale MX records for domains that recently changed providers. TTL values on MX records typically range from 300 seconds to 86,400 seconds (5 minutes to 24 hours). If a domain migrated from Google Workspace to Microsoft 365 yesterday and your cache hasn’t expired, you’re connecting to the wrong mail server.
Don’t cache MX results in your application layer beyond the DNS TTL. Let your resolver handle it.
SMTP Verification: Where Servers Lie
MX lookups tell you where to connect. SMTP verification asks the server: “Will you accept mail for this recipient?” Sounds straightforward. It isn’t.
The EHLO Problem
Your SMTP handshake starts with EHLO (or HELO for older servers). The argument is supposed to be a valid hostname identifying the connecting client. Some mail servers check this. Microsoft 365 validates that the EHLO hostname resolves in DNS. If your verification service sends EHLO check.example.com and that hostname doesn’t have an A record, the server may reject the connection with 550 5.7.1.
// Node.js: SMTP RCPT TO check with proper EHLO
const net = require("net");
function smtpVerify(email, mxHost, options = {}) {
const { timeout = 10000, ehloHost = "verify.truemail.io" } = options;
return new Promise((resolve, reject) => {
const socket = net.createConnection(25, mxHost);
socket.setTimeout(timeout);
let step = "connect";
let buffer = "";
socket.on("data", (data) => {
buffer += data.toString();
if (!buffer.includes("\r\n")) return;
const code = parseInt(buffer.substring(0, 3), 10);
buffer = "";
if (step === "connect" && code === 220) {
step = "ehlo";
socket.write(`EHLO ${ehloHost}\r\n`);
} else if (step === "ehlo" && code === 250) {
step = "mail";
socket.write(`MAIL FROM:<check@${ehloHost}>\r\n`);
} else if (step === "mail" && code === 250) {
step = "rcpt";
socket.write(`RCPT TO:<${email}>\r\n`);
} else if (step === "rcpt") {
socket.write("QUIT\r\n");
socket.end();
resolve({
email,
deliverable: code === 250,
code,
catchAll: false,
});
} else {
socket.write("QUIT\r\n");
socket.end();
resolve({ email, deliverable: false, code, error: "rejected" });
}
});
socket.on("timeout", () => {
socket.destroy();
resolve({ email, deliverable: null, error: "timeout" });
});
socket.on("error", (err) => {
resolve({ email, deliverable: null, error: err.message });
});
});
}
Catch-All Domains: 250 OK for Everything
Gmail’s consumer service returns 250 OK for every RCPT TO command. Doesn’t matter if the mailbox exists. [email protected]? 250 OK. This isn’t a bug. It’s a deliberate anti-harvesting measure.
Catch-all domains represent 15-28% of B2B domains. Against these servers, your SMTP RCPT TO check provides zero additional signal beyond what MX validation already gave you. The MX vs SMTP validation accuracy breakdown covers the full impact on verification rates.
How do you detect a catch-all? Probe with an address that almost certainly doesn’t exist:
import uuid
def detect_catch_all(domain: str, mx_host: str) -> bool:
"""Send RCPT TO for a random address. If 250, it's catch-all."""
fake_address = f"{uuid.uuid4().hex[:16]}@{domain}"
result = smtp_verify(fake_address, mx_host, timeout=10)
return result.get("code") == 250
If the server accepts a random UUID address, it accepts everything. Mark the domain as catch-all and fall back to heuristic scoring instead of trusting the SMTP response.
Greylisting: Come Back Later
Greylisting servers reject the first connection from an unknown sender with a 450 temporary failure code. The server remembers the (IP, sender, recipient) triplet and expects you to retry after a delay. RFC 6647 documents this behavior. Default delay in popular implementations like Postgrey: 5 minutes. RFC 5321 recommends waiting at least 30 minutes before retrying.
For real-time validation? Greylisting kills you. Your user isn’t waiting 5 minutes for a signup form to respond.
For batch validation, build retry logic that handles 450 responses:
// Retry with exponential backoff for greylisted servers
async function verifyWithRetry(email, mxHost, maxRetries = 3) {
let lastResult;
for (let attempt = 0; attempt <= maxRetries; attempt++) {
lastResult = await smtpVerify(email, mxHost, { timeout: 10000 });
if (lastResult.code !== 450 && lastResult.code !== 421) {
return lastResult;
}
if (attempt < maxRetries) {
const baseDelay = 60000; // 1 minute
const delay = baseDelay * Math.pow(2, attempt) + Math.random() * 5000;
await new Promise((r) => setTimeout(r, delay));
}
}
return { ...lastResult, status: "unknown", reason: "greylisted" };
}
Starting at 1 minute and doubling each retry (with jitter) covers most greylisting implementations. Three retries span about 7 minutes total. That catches the 5-minute Postgrey default while staying under the 30-minute RFC recommendation.
Rate Limiting: When Mail Servers Push Back
Hit a mail server too hard and you’ll get 421 Service not available or 452 Too many connections. Microsoft 365 limits SMTP connections to 3 concurrent per mailbox and 30 messages per minute. Gmail detects verification patterns and starts returning false 250 OK responses to poison your results.
The fix is a rate limiter per destination domain, not per total throughput.
import time
from collections import defaultdict
from threading import Lock
class DomainRateLimiter:
"""Rate limit SMTP connections per destination domain."""
def __init__(self, max_per_second: float = 2.0):
self.max_per_second = max_per_second
self.min_interval = 1.0 / max_per_second
self.last_call = defaultdict(float)
self.lock = Lock()
def wait(self, domain: str):
with self.lock:
now = time.monotonic()
elapsed = now - self.last_call[domain]
if elapsed < self.min_interval:
time.sleep(self.min_interval - elapsed)
self.last_call[domain] = time.monotonic()
rate_limiter = DomainRateLimiter(max_per_second=2.0)
def rate_limited_verify(email: str, mx_host: str, domain: str) -> dict:
rate_limiter.wait(domain)
return smtp_verify(email, mx_host, timeout=10)
Two connections per second per domain is conservative enough for most mail servers. You can push higher against large providers (Google, Microsoft) but watch for 421 responses and back off dynamically.
Circuit Breakers: Stop Hammering Dead Servers
A mail server goes down. Your validation pipeline keeps connecting, timing out after 10 seconds each time, and retrying. Meanwhile, your background queue backs up with thousands of jobs all waiting on the same dead server.
A circuit breaker stops this. After N consecutive failures to a domain, the breaker opens and short-circuits further attempts for a cooldown period.
// Simple circuit breaker per domain
class DomainCircuitBreaker {
constructor(failureThreshold = 5, cooldownMs = 300000) {
this.failureThreshold = failureThreshold;
this.cooldownMs = cooldownMs;
this.domains = new Map();
}
isOpen(domain) {
const state = this.domains.get(domain);
if (!state) return false;
if (state.failures < this.failureThreshold) return false;
if (Date.now() - state.lastFailure > this.cooldownMs) {
this.domains.delete(domain);
return false;
}
return true;
}
recordFailure(domain) {
const state = this.domains.get(domain) || { failures: 0 };
state.failures += 1;
state.lastFailure = Date.now();
this.domains.set(domain, state);
}
recordSuccess(domain) {
this.domains.delete(domain);
}
}
const breaker = new DomainCircuitBreaker(5, 300000); // 5 failures, 5 min cooldown
async function verifyWithBreaker(email, mxHost, domain) {
if (breaker.isOpen(domain)) {
return { email, status: "unknown", reason: "circuit_open" };
}
const result = await smtpVerify(email, mxHost, { timeout: 10000 });
if (result.error === "timeout" || result.error?.includes("ECONNREFUSED")) {
breaker.recordFailure(domain);
} else {
breaker.recordSuccess(domain);
}
return result;
}
Five consecutive timeouts to mail.example.com? Stop trying for 5 minutes. Return unknown and let your system handle it downstream. This prevents one flaky domain from clogging your entire validation queue.
Putting It All Together: The Resilient Pipeline
Individual fixes aren’t enough. You need them composed into a pipeline that degrades gracefully. Here’s the full flow:
- Parse and syntax-check the email address (under 1ms)
- MX lookup with 3-second timeout
- If MX lookup times out, return
unknown(don’t block) - If no MX records, check A/AAAA fallback, then return
invalidif nothing - Check circuit breaker for the target domain
- If breaker is open, return
unknownwith the MX-validated status - SMTP
RCPT TOwith 10-second timeout - If
450response, queue for retry (greylisting) - If
250on a known catch-all domain, returncatch_allinstead ofdeliverable - Record success/failure in the circuit breaker
The key insight: when SMTP fails, fall back to MX-only validation rather than returning invalid. An address on a domain with valid MX records is more likely deliverable than not. You just couldn’t confirm it. Mark it unknown and let the caller decide how to handle uncertainty.
def validate_email(email: str, mode: str = "full") -> dict:
local, domain = email.rsplit("@", 1)
# Step 1: MX lookup
mx_result = lookup_mx(domain, timeout_sec=3.0)
if mx_result["status"] == "no_domain":
return {"email": email, "status": "invalid", "reason": "domain_not_found"}
if mx_result["status"] == "no_mx":
return {"email": email, "status": "invalid", "reason": "no_mx_records"}
if mx_result["status"] == "timeout":
return {"email": email, "status": "unknown", "reason": "dns_timeout"}
mx_host = mx_result["records"][0]["host"]
is_catch_all = detect_catch_all(domain, mx_host)
if mode == "mx_only" or is_catch_all:
status = "catch_all" if is_catch_all else "mx_valid"
return {"email": email, "status": status, "mx_host": mx_host}
# Step 2: SMTP verification with circuit breaker
if breaker.is_open(domain):
return {"email": email, "status": "unknown", "reason": "circuit_open"}
result = smtp_verify(email, mx_host, timeout=10)
if result.get("error") == "timeout":
breaker.record_failure(domain)
return {"email": email, "status": "unknown", "reason": "smtp_timeout"}
if result.get("code") == 450:
return {"email": email, "status": "retry", "reason": "greylisted"}
breaker.record_success(domain)
deliverable = result.get("code") == 250
return {"email": email, "status": "deliverable" if deliverable else "undeliverable"}
This is what a production email validation API handles for you under the hood. Every call to MailCop.validate() runs through this kind of pipeline (plus proprietary layers for catch-all scoring, disposable detection, and historical bounce data). Understanding the internals helps you design around edge cases even if you’re not building the validation layer yourself.
Real-Time vs Batch: Different Failure Budgets
The pipeline above works differently depending on context. For real-time vs bulk validation, the timeout budgets and retry strategies diverge sharply.
Real-time (signup forms): 3-second DNS timeout. No SMTP. No retries. Fail open. Mark unresolvable addresses as pending and verify in the background.
Batch (list cleaning): 10-15 second DNS timeout. Full SMTP with retry on 450. Circuit breakers with 5-minute cooldowns. Process overnight. Use webhook callbacks to notify your system when results are ready.
The same validation logic, tuned for two very different latency constraints. Don’t use batch settings on your signup form. Don’t use real-time timeouts on your list cleaner.
Monitoring: What to Track
Four numbers tell you if your validation pipeline is healthy.
DNS timeout rate. Track the percentage of MX lookups that time out per hour. Healthy: under 1%. Concerning: 1-3%. Broken: above 5%. A spike usually means an upstream resolver issue or a batch hitting domains with misconfigured nameservers.
SMTP timeout rate by domain. Some domains are consistently slow. Track per-domain SMTP response times and flag domains that regularly exceed 5 seconds. These are candidates for MX-only fallback.
Circuit breaker open rate. How often are breakers tripping? If the same domains keep opening circuits, consider permanently downgrading them to MX-only validation.
Unknown rate overall. What percentage of addresses come back as unknown? That’s your confidence gap. Below 5% is solid. Above 10% means too many failures are going unresolved. Dig into whether it’s DNS, SMTP, or greylisting driving the unknowns.
Stop Trusting the Happy Path
DNS resolves fast and SMTP servers tell the truth. That’s the happy path, and it covers about 85% of validations. The other 15% is where your pipeline either degrades gracefully or falls apart.
Build for the 15%. Timeout everything. Retry with backoff. Break circuits before they break your queue. Fall back to MX-only when SMTP fails instead of returning false negatives. And track your unknown rate, because that number tells you how much reality diverges from your assumptions.
The difference between a validation system that works and one that works reliably is how it handles failure. Every production system fails. Yours should fail well.