Punycode and IDN: How Unicode Domain Names Work

Reference Encoding Survival Guide ส.ค. 20, 2024

○ 1. UTF-8: The Complete Guide to the Web's Dominant Encoding
○ 2. Mojibake: Why Text Turns to Garbage and How to Fix It
○ 3. Character Encoding Detection: How Browsers and Tools Guess Your Encoding
○ 4. UTF-16 and Surrogate Pairs: Why JavaScript Strings Are Complicated
○ 5. Legacy Encodings: Latin-1, Windows-1252, Shift-JIS, and When You Still Need Them
● 6. Punycode and IDN: How Unicode Domain Names Work

สารบัญ

The domain name system was designed in 1987 for a 7-bit ASCII world. DNS labels can only contain letters A–Z, digits 0–9, and hyphens — a 38-character alphabet for a world with thousands of scripts and millions of registered speakers of Chinese, Arabic, Hindi, Russian, and hundreds of other languages. Internationalized Domain Names (IDN) solve this through Punycode — an encoding that represents any Unicode string as an ASCII-compatible label. Understanding Punycode matters for developers building international applications, security-conscious engineers aware of homograph attacks, and anyone debugging why münchen.de becomes xn--mnchen-3ya.de.

Why Domains Can't Use Unicode Directly

The constraint is not arbitrary. DNS resolvers, registrars, TLD operators, and name servers are distributed systems with billions of deployed nodes. Many were written in the 1990s and early 2000s with hard-coded assumptions about valid label characters. A resolver that encounters a non-ASCII byte in a domain name label has undefined behavior — it might ignore it, corrupt it, or reject the query entirely.

The design requirement was therefore strict: IDN encoding must produce labels that are valid ASCII hostname labels while being:

Reversible (Unicode → ASCII → Unicode with no loss)
Distinguishable from regular ASCII labels (so old resolvers don't accidentally serve münchen.de for a label intended for mnchen.de)
Stable (the same Unicode string always produces the same ASCII label)

Punycode satisfies all three requirements.

The xn-- Prefix

Before getting into the Punycode algorithm, the overall IDN architecture is simple. Unicode domain labels are encoded as:

xn--{punycode-encoded-unicode}

The xn-- prefix (ASCII Compatible Encoding prefix, or ACE prefix) signals to IDN-aware software that the label should be Punycode-decoded. Old DNS resolvers that don't understand IDN treat xn--mnchen-3ya.de as a normal ASCII label, resolve it correctly in the DNS, and deliver the response. IDN-aware resolvers decode it to münchen.de for display.

import encodings.idna

# Encode Unicode domain to Punycode ACE form
'münchen.de'.encode('idna').decode('ascii')
# 'xn--mnchen-3ya.de'

# Decode ACE form back to Unicode
'xn--mnchen-3ya.de'.encode('ascii').decode('idna')
# 'münchen.de'

# More examples
'中文.com'.encode('idna').decode('ascii')        # 'xn--fiq228c.com'
'пример.испытание'.encode('idna').decode('ascii')  # Russian .test domain
# 'xn--e1afmapc.xn--80akhbyknj4f'

# Each label is encoded separately
'日本語.jp'.encode('idna').decode('ascii')
# 'xn--wgv71a309e.jp'

Only labels containing non-ASCII characters get the xn-- prefix. Labels that are already valid ASCII remain unchanged:

shop.münchen.de → shop.xn--mnchen-3ya.de

The Punycode Algorithm: Bootstring Encoding

Punycode (RFC 3492, 2003) is a specific instance of the "bootstring" algorithm, designed to represent a Unicode string using a subset of ASCII characters. The algorithm is elegant: it handles mixed ASCII/Unicode strings by keeping the ASCII characters in-place and appending the non-ASCII information as a suffix.

The output of Punycode encoding for a label follows this structure:

{basic-code-points}-{delta-encoded-non-basic-code-points}

For münchen: - Basic (ASCII) characters: m, n, c, h, e, n (6 characters) - Non-basic character: ü (at position 1, code point U+00FC) - Result: mnchen-3ya (the -3ya encodes position and code point of ü)

The delta encoding uses a generalized variable-length integer scheme (base 36, with bias adaptation) that compresses the position and code point value into a short sequence of ASCII characters. The full algorithm is detailed in RFC 3492, but the key properties for developers are:

The hyphen separator (-) separates basic from non-basic portions
If all characters are non-ASCII (e.g., 中文), there's no hyphen: fiq228c
Character case in Punycode is significant for the basic portion but the overall IDN comparison is case-insensitive

# Python's punycode codec (raw Punycode, without the IDN label processing)
'münchen'.encode('punycode').decode('ascii')  # 'mnchen-3ya'
'中文'.encode('punycode').decode('ascii')       # 'fiq228c'

# The IDNA codec handles the full label process (normalization + Punycode + xn--)
'münchen'.encode('idna')  # b'xn--mnchen-3ya'

IDNA 2003 vs IDNA 2008

There are two versions of the IDNA (Internationalized Domain Names in Applications) standard, and they disagree on how to process Unicode characters before Punycode encoding.

IDNA 2003 (RFC 3490/3491/3492)

IDNA 2003 uses NAMEPREP, a profile of the Stringprep framework. Key behaviors:

Applies Unicode normalization (NFKC) to fold compatibility variants
Maps certain characters: uppercase to lowercase, some compatibility characters to their canonical equivalents
Prohibits certain characters (combining marks, directional formatting, invisible characters)
Maps some characters that IDNA 2008 prohibits (e.g., ß maps to ss)

IDNA 2008 (RFC 5890/5891/5892/5893)

IDNA 2008 takes a stricter approach, defining which code points are valid, restricted, or disallowed using Unicode character properties:

Valid (PVALID): May be used in domain labels
Contextual (CONTEXTJ/CONTEXTO): May be used only in specific contexts (e.g., Devanagari virama requires certain neighboring characters)
Disallowed (DISALLOWED): May not be used at all

The critical differences:

Character	IDNA 2003	IDNA 2008
`ß` (German sharp s)	Maps to `ss`	Valid as-is (PVALID)
`ς` (Greek final sigma)	Maps to `σ`	Valid as-is (PVALID)
Emoji	Prohibited	Prohibited (DISALLOWED)
Zero-width joiner	Allowed in some contexts	CONTEXTJ (more restrictive)

This means faß.de encodes differently under the two standards: - IDNA 2003: faß.de → fass.de (mapped to ss) - IDNA 2008: faß.de → xn--fa-hia.de (encoded as-is)

Both are registered as valid domains, but they point to different labels. An IDNA 2003 client looking up faß.de would query fass.de; an IDNA 2008 client would query xn--fa-hia.de.

What Browsers and Libraries Use

Software	IDNA version
Chrome	IDNA 2008 (via ICU)
Firefox	IDNA 2008
Safari	IDNA 2008
Python `encodings.idna`	IDNA 2003
Python `idna` library (pip)	IDNA 2008
Node.js `url` module	IDNA 2008 (via V8/ICU)
OpenSSL	IDNA 2003

The inconsistency between Python's built-in idna codec (2003) and browser behavior (2008) creates practical bugs. For any serious IDN work in Python, use the idna package from PyPI:

# Built-in (IDNA 2003 — may differ from browsers)
'münchen.de'.encode('idna')

# PyPI idna package (IDNA 2008 — matches browsers)
import idna
idna.encode('münchen.de')          # b'xn--mnchen-3ya.de'
idna.decode('xn--mnchen-3ya.de')   # 'münchen.de'

# Check a label's validity under IDNA 2008
idna.check_label('münchen')        # Returns None if valid

Browser Display Policies

Browsers don't always display the Unicode form of an IDN domain. They apply their own policies to prevent visual confusion, and these policies have become stricter over time in response to homograph attacks.

Current Chrome/Edge policy (as of 2022+): Display the Unicode form only if the label meets all of these conditions:

All characters are from a single script (or a permitted script mixture)
All characters are in the browser's "safe" whitelist
The domain passes the registry's own IDN policy (if known)
The label doesn't mix scripts in confusing ways (e.g., Latin + Cyrillic)

If any condition fails, the browser displays the xn-- Punycode form in the address bar.

apple.com           → displayed as apple.com
аррlе.com           → displayed as xn--80ak6aa92e.com  (Cyrillic 'а' and 'р')
münchen.de          → displayed as münchen.de
中文.com            → displayed as 中文.com

Firefox has similar policies but implements them differently in some edge cases. Safari follows IDNA 2008 and also has confusable-character detection.

Homograph Attacks

The IDN homograph attack (described in 2001 by Evgeniy Gabrilovich and Alex Gontmakher) exploits the fact that many Unicode characters look visually identical to Latin letters used in common domains.

Classic examples:

Fake domain	Looks like	What it actually is
`аррlе.com`	`apple.com`	Four Cyrillic characters (а=U+0430, р=U+0440)
`ɡoogle.com`	`google.com`	U+0261 LATIN SMALL LETTER SCRIPT G
`paypaI.com`	`paypal.com`	Capital I instead of lowercase l (Latin, not IDN)
`xn--pple-43d.com`	`apple.com`	`àpple.com` with grave accent

The last example is particularly interesting: it's not an IDN homograph per se, but it demonstrates how Punycode in the URL bar reveals the deception that Unicode display would hide.

Prevention for Developers

If you're building a system that allows user-input of domain names:

import idna
import unicodedata

def validate_domain_for_display(domain: str) -> dict[str, str | bool]:
    """Check if a domain name might be a homograph attack."""
    labels = domain.lower().split('.')

    result = {
        'punycode': '',
        'original': domain,
        'mixed_script': False,
        'has_confusables': False,
    }

    try:
        # Encode to Punycode
        result['punycode'] = idna.encode(domain).decode('ascii')
    except idna.core.InvalidCodepoint as e:
        result['error'] = str(e)
        return result

    # Check for mixed scripts within a label
    for label in labels:
        scripts = set()
        for char in label:
            # Get Unicode script property (requires unicodedata2 for full data)
            cat = unicodedata.category(char)
            if cat != 'Ll' and cat != 'Lu':
                continue
            # Simple heuristic: check code point ranges
            cp = ord(char)
            if 0x0041 <= cp <= 0x007A:
                scripts.add('Latin')
            elif 0x0400 <= cp <= 0x04FF:
                scripts.add('Cyrillic')
            elif 0x0370 <= cp <= 0x03FF:
                scripts.add('Greek')

        if len(scripts) > 1:
            result['mixed_script'] = True

    return result

# Usage
print(validate_domain_for_display('аррlе.com'))
# {'punycode': 'xn--80ak6aa92e.com', ..., 'mixed_script': False}
# (All Cyrillic — not mixed, but still suspicious vs apple.com)

For a robust confusable check, use the Unicode Consortium's confusables data (unicode-confusables Python package or ICU's USpoofChecker):

# Using the unicode-confusables package
from confusables import is_confusable

is_confusable('аррlе', preferred_aliases=['latin'])  # True
is_confusable('apple', preferred_aliases=['latin'])   # False

Implementing IDN in Applications

Python

import idna

# Encoding (Unicode → ACE)
ace = idna.encode('münchen.de', alg='IDNA2008').decode('ascii')
print(ace)  # 'xn--mnchen-3ya.de'

# Decoding (ACE → Unicode)
unicode_domain = idna.decode('xn--mnchen-3ya.de')
print(unicode_domain)  # 'münchen.de'

# For use with urllib/requests — handle IDN transparently
from urllib.parse import urlparse

def normalize_url(url: str) -> str:
    """Ensure URLs with IDN domains use ACE form for HTTP requests."""
    parsed = urlparse(url)
    try:
        ace_host = idna.encode(parsed.hostname, alg='IDNA2008').decode('ascii')
    except (idna.core.InvalidCodepoint, idna.core.InvalidCodepointContext):
        ace_host = parsed.hostname  # Already ASCII or invalid
    return parsed._replace(netloc=ace_host).geturl()

normalize_url('https://münchen.de/page')
# 'https://xn--mnchen-3ya.de/page'

JavaScript / Node.js

Modern Node.js handles IDN natively in the URL constructor:

// URL API handles IDN automatically
const url = new URL('https://münchen.de/path');
console.log(url.hostname);   // 'xn--mnchen-3ya.de' (ACE form)
console.log(url.href);       // 'https://xn--mnchen-3ya.de/path'

// For display, use the unicode form
const displayHost = url.hostname
    .split('.')
    .map(label => label.startsWith('xn--')
        ? decodeURIComponent(label.slice(4).replace(/-/g, '%').replace(/[a-z0-9]/g, s => '%' + s.charCodeAt(0).toString(16)))
        : label)
    .join('.');
// Better: use a library

// Use 'tr46' package for robust IDNA 2008 support
const tr46 = require('tr46');  // npm install tr46
tr46.toASCII('münchen.de');    // 'xn--mnchen-3ya.de'
tr46.toUnicode('xn--mnchen-3ya.de');  // 'münchen.de'

TLS Certificates and IDN

TLS certificates (for HTTPS) handle IDN domains in one of two ways:

ACE form in the Subject Alternative Name (SAN): The certificate lists xn--mnchen-3ya.de. This is the most common approach.
Unicode form in SAN: Technically allowed by RFC 5280 but rarely used in practice.

When verifying a certificate for an IDN domain, clients must compare the ACE form of the requested host against the certificate's SAN. This is handled automatically by TLS libraries (OpenSSL, NSS, Secure Transport) but worth knowing if you're writing custom certificate validation.

Domain Registration Rules

Not all Unicode characters can be registered as domain labels, even if Punycode can encode them. ICANN and individual registry operators impose additional restrictions:

Most ccTLD registries (.de, .jp, .kr, .cn) maintain their own character lists
Script mixing rules vary by registry (some allow Latin+digits, others restrict to a single script)
Some characters are reserved or prohibited by ICANN's registry agreements
Emoji domains (🍕.ws, technically possible) are supported by some ccTLDs but not gTLDs

The practical result: just because you can encode a string with Punycode doesn't mean you can register it as a domain. Always check the specific registry's IDN policy.

You can inspect how any Unicode string encodes to Punycode using our Encoding Converter — paste a domain name to see its ACE form and the individual byte representations of each label.

Summary

Punycode allows the 30-year-old DNS infrastructure to carry Unicode domain names without modification. The mechanism is straightforward: Unicode labels get xn-- prepended and Punycode-encoded; DNS resolves them as ASCII; IDN-aware software decodes for display. The IDNA 2003/2008 split creates real compatibility issues that affect Python in particular — use the idna PyPI package to match browser behavior. And homograph attacks are a genuine security concern: any system processing user-supplied domain names should normalize to Punycode and check for confusable characters before trusting or displaying them.

Encoding Survival Guide — Complete. You've now covered the full spectrum: UTF-8's byte patterns, diagnosing mojibake, how detection algorithms work, JavaScript's UTF-16 internals, legacy encodings, and IDN. Explore individual characters, their encodings, and byte sequences on SymbolFYI's character reference pages.