Internationalized Email Addresses: How to Validate Unicode Emails

hangrydev ·

Your Regex Just Rejected a Legitimate User

A Chinese user tries to sign up with 用户@例え.jp. Your form returns “invalid email.” The address is perfectly real. The user leaves.

That’s the problem with unicode email addresses in one scenario. Email Address Internationalization (EAI) and Internationalized Domain Names (IDN) are full standards, backed by RFCs, and increasingly adopted by major providers. But most validation code was written in the early 2000s ASCII world and never updated, so it rejects unicode email addresses outright.

The gap between what the standards allow and what your validator actually accepts is probably larger than you think.

What the Standards Actually Allow

The original email specs (RFC 5321, RFC 5322) defined email addresses as ASCII-only. That ruled out billions of people whose native scripts use non-ASCII characters.

RFC 6530, 6531, and 6532 changed that. Together they define the EAI (Email Address Internationalization) framework, which permits UTF-8 characters in both the local part and the domain of an email address.

So these are valid addresses under current standards:

  • 用户@例え.jp (Chinese/Japanese)
  • почта@домен.рф (Russian Cyrillic)
  • user@münchen.de (German with umlaut)
  • علي@مثال.إختبار (Arabic)
  • 사용자@도메인.한국 (Korean)

The domain side has been handled separately through IDN (RFC 5891, 5892). IDN domains get converted to Punycode (RFC 3492), a way of encoding non-ASCII domain names into ASCII-compatible encoding (ACE). münchen.de becomes xn--mnchen-3ya.de at the DNS layer. That part works well and has broad support.

The local part (everything before the @) is the harder problem. UTF-8 local parts require SMTPUTF8 support from every server in the delivery chain. That’s the RFC 6531 extension, and adoption is patchy.

How Do You Validate Unicode Email Addresses?

To validate unicode email addresses, parse the local part and domain separately, NFC-normalize the local part, and convert the domain to Punycode for DNS lookups. Use an EAI-aware validation library instead of an ASCII-only regex, since a regex rejects valid addresses outright. For delivery, check whether the destination server advertises SMTPUTF8 before you trust a UTF-8 local part. The rest of this post walks through each layer.

The Two Parts Have Very Different Support

Understanding EAI means treating the local part and domain part separately.

IDN domains (the part after @): Well-supported. Punycode conversion is handled transparently by DNS resolvers and most mail libraries. user@münchen.de resolves via DNS just fine because the domain converts to xn--mnchen-3ya.de under the hood. Your existing MX lookup code almost certainly handles IDN domains already.

UTF-8 local parts (the part before @): Still limited. Delivery requires the SMTPUTF8 SMTP extension at every hop. The sending server, receiving server, and any relays in between must all support it. Gmail added SMTPUTF8 support in 2014, but many enterprise mail servers and hosting providers haven’t followed.

Why does this matter? An address like 用户@gmail.com can only be delivered if Gmail’s servers accept SMTPUTF8 connections and the sending server announces the extension. An address like user@用户.com with a Punycode-compatible domain has much better delivery odds regardless.

What Your Validation Code Is Probably Doing Wrong

Most email validation libraries reject non-ASCII by default. That’s not a bug from their perspective. It was correct behavior for the original standards.

The issue is that those libraries haven’t been updated for EAI, so they silently reject valid international addresses without any indication of why.

Common failure modes:

Regex-based validators almost always reject non-ASCII local parts. A pattern like [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} won’t match 用户@例え.jp at all.

Library validators without EAI support vary. Python’s built-in email.utils.parseaddr will parse the address but won’t validate it correctly. JavaScript’s native checkValidity() on an <input type="email"> rejects EAI addresses entirely.

DNS-based validators handle IDN domains fine but may fail on non-ASCII local parts depending on how they construct the lookup query.

The email validation API guide covers the full validation stack. For international addresses, the practical answer is to use a validation service that explicitly supports EAI rather than trying to patch your local library.

Normalization: The Hidden Complexity

Even when a validator accepts Unicode, normalization trips things up.

Unicode has multiple ways to represent the same character. The letter é can be a single code point (U+00E9, precomposed) or two code points (U+0065 + U+0301, e followed by a combining accent). These look identical on screen. To your string comparison logic, they’re different.

This matters for email addresses because a user might register with one normalization form and try to log in with another form that their OS or keyboard produced. Your duplicate detection breaks. Your login lookup fails.

The fix is NFC normalization (Canonical Decomposition, followed by Canonical Composition). Apply NFC to the local part before any storage, comparison, or validation. In Python, that’s unicodedata.normalize('NFC', local_part). In JavaScript, localPart.normalize('NFC').

Do this before you validate. Do this before you store. Be consistent.

Homograph Attacks Are a Real Risk

This one’s worth understanding before you allow Unicode in email addresses.

Cyrillic а (U+0430) and Latin a (U+0061) look identical in most fonts. So do several Greek, Cyrillic, and Latin characters. An attacker can register а[email protected] where that first character is Cyrillic, creating an address visually indistinguishable from [email protected].

The same attack applies to IDN domains. pаypal.com with a Cyrillic а can be registered as a distinct domain from paypal.com. Browsers now display Punycode warnings for mixed-script domains. Email clients don’t have equivalent protections.

Practical mitigations:

  • Restrict to single-script local parts. An address shouldn’t mix Cyrillic and Latin characters unless you have a specific reason to allow it.
  • Apply confusables detection. Unicode maintains a list of visually similar characters. Libraries like Python’s unicodedata and third-party confusables packages can flag these.
  • For high-value flows (admin accounts, billing emails), restrict to ASCII-only. The UX cost is worth the security benefit.

This doesn’t mean reject all Unicode. It means validate what’s in the address, not just whether it parses.

Validating EAI Addresses in Practice

Here’s how to handle internationalized addresses at each layer of the validation stack.

Syntax Validation

Don’t use a pure-ASCII regex. Use a library that understands the EAI spec or write one that covers it.

# Python - using the email-validator library with EAI support
from email_validator import validate_email, EmailNotValidError

try:
    # check_deliverability=False for syntax-only validation
    validated = validate_email("用户@例え.jp", check_deliverability=False, allow_smtputf8=True)
    normalized = validated.normalized  # NFC-normalized address
    local = validated.local_part
    domain = validated.ascii_domain  # Punycode domain for DNS
    print(f"Valid: {normalized}")
except EmailNotValidError as e:
    print(f"Invalid: {e}")
// JavaScript - no native EAI support, use a library
// Most JS email validators reject EAI by default
// Check your library's docs for unicode/EAI flags

function isValidEmail(email) {
  // NFC normalize before anything else
  const normalized = email.normalize('NFC');
  // Then validate with an EAI-aware library
  return validateWithEAISupport(normalized);
}

DNS and MX Validation

The domain side works fine with standard MX lookups once you convert to Punycode. Most DNS libraries handle this automatically.

import dns.resolver
import encodings.idna

def check_mx_records(domain):
    # Convert IDN domain to ASCII for DNS lookup
    try:
        ascii_domain = domain.encode('idna').decode('ascii')
    except (UnicodeError, UnicodeDecodeError):
        return False

    try:
        records = dns.resolver.resolve(ascii_domain, 'MX')
        return len(records) > 0
    except dns.exception.DNSException:
        return False

Standard MX and SMTP verification covers the full picture here. For EAI specifically, the domain side is the easy part.

SMTP Verification

Verifying a UTF-8 local part via SMTP requires the SMTPUTF8 extension. The server advertises it in the EHLO response, and you include SMTPUTF8 as a parameter in the MAIL FROM command when sending.

import smtplib

def verify_eai_smtp(email, local_part, ascii_domain):
    with smtplib.SMTP(ascii_domain) as smtp:
        smtp.ehlo()
        if not smtp.has_extn('SMTPUTF8'):
            # Server doesn't support EAI delivery
            # Address syntax may be valid but delivery is unknown
            return "eai_unsupported"

        smtp.mail('', options=['SMTPUTF8'])
        code, _ = smtp.rcpt(email)
        return "valid" if code == 250 else "invalid"

If the server doesn’t advertise SMTPUTF8, you can’t verify the local part via SMTP. The address might be syntactically valid but the delivery status is unknown. That’s the honest answer.

Handling EAI in Django and Next.js

If you’re building forms with Django, the built-in EmailValidator doesn’t handle EAI by default. The Django email validation guide shows how to integrate a proper validation API. For EAI specifically, you’d extend the validator or bypass it for international addresses and pass them to an API endpoint instead.

In Next.js with server actions, Zod’s email validator rejects EAI addresses. The Next.js email validation guide covers integrating MailCop for server-side validation. The pattern is the same: replace the regex check with an API call that handles EAI.

// Next.js server action - EAI-aware validation
"use server";

async function validateEmail(email: string) {
  // Don't rely on Zod's built-in email() for EAI addresses
  const response = await fetch(
    `https://mailcop.net/api/v1/validate?email=${encodeURIComponent(email)}`,
    { headers: { Authorization: `Bearer ${process.env.TRUEMAIL_API_KEY}` } }
  );
  const result = await response.json();
  return result;
}

Don’t Over-Filter on Disposable Domain Lists

One thing to watch for: disposable email blocklists sometimes flag international domains incorrectly. A .рф TLD domain isn’t inherently disposable, but some blocklists treat unfamiliar TLDs as suspicious.

If you’re using a disposable email blocklist, verify that it handles IDN TLDs correctly. Some lists encode Punycode, some use Unicode. A mismatch means legitimate international addresses get blocked as disposable.

What Actually Works in Production

Given where adoption sits today, here’s the practical breakdown:

IDN domains with ASCII local parts (user@münchen.de, user@例え.jp): Handle these. Punycode conversion is solid, MX lookups work, and SMTP delivery is normal. There’s no excuse for rejecting these.

IDN domains with UTF-8 local parts (用户@gmail.com, 用户@例え.jp): Accept these at the form level if they pass syntax validation. For SMTP verification, check whether the destination server advertises SMTPUTF8 before making claims about deliverability. If it doesn’t, the honest status is “unverified” rather than “invalid.”

What should you tell the user when you can’t verify? Don’t say “invalid.” Say nothing, or surface a soft warning that delivery can’t be confirmed. Rejecting a valid address is worse than sending to an unverified one.

Storage: Store the original UTF-8 form after NFC normalization. Also store the Punycode domain separately for DNS lookups. Don’t coerce to ASCII.

Display: Show users their address in the form they entered it, not Punycode. 用户@例え.jp in the UI, xn--r8jz45g.jp in the DNS query.

Rate limits and deduplication: Normalize to NFC before comparing. Two addresses that look identical should map to the same normalized form.

The summary: IDN domain support is table stakes. UTF-8 local part support is a best-effort situation until SMTPUTF8 adoption catches up. A validation API that understands both saves you from building these edge cases yourself.

The Users You’re Currently Rejecting

RFC 6530 has been around since 2012. SMTPUTF8 has been in production at Gmail since 2014. The .рф TLD has been active since 2010. This isn’t experimental.

The population of users with internationalized addresses isn’t small. It includes most of East Asia, Russia, the Middle East, and parts of Europe. If your form rejects their addresses, they don’t get an error message that explains what happened. They just can’t sign up.

Fix the easy part first: IDN domain support. Then add EAI syntax validation with NFC normalization. Then decide how you handle the SMTPUTF8 gap based on your user base and risk tolerance.

The users are real. The standard is real. The only thing that’s broken is the validator.