IDN Homograph Attacks: When Unicode Becomes a Security Threat

Web Development Symbols for Developers Nis 30, 2024

○ 1. HTML Entities: The Complete Guide to Character References
○ 2. CSS Content Property: Using Unicode Symbols in Stylesheets
○ 3. Unicode-Aware Regex: Property Escapes and Multilingual Patterns
○ 4. JavaScript and Unicode: Strings, Code Points, and Grapheme Clusters
○ 5. Python and Unicode: The Complete Developer's Guide
○ 6. Unicode in URLs: Percent-Encoding, Punycode, and IRIs
● 7. IDN Homograph Attacks: When Unicode Becomes a Security Threat
○ 8. Web Fonts and Unicode Subsetting: Loading Only What You Need
○ 9. Character Encoding Detection: How to Identify Unknown Text Encoding
○ 10. Unicode Collation: How to Sort Text Correctly Across Languages

İçindekiler

In 2001, security researcher Evgeniy Gabrilovich demonstrated that internationalized domain names could be used to create visually perfect replicas of legitimate domains using characters from other scripts. The attack vector is elegant in its simplicity: а (U+0430 Cyrillic small letter a) is visually indistinguishable from a (U+0061 Latin small letter a). To a human reading a URL, apple.com spelled with Cyrillic characters looks identical to the real one.

This class of attack is called an IDN homograph attack (also called a Unicode spoofing attack or homoglyph attack).

How the Attack Works

The International Domain Names in Applications (IDNA) standard allows domain registrars to register domains using Unicode characters. These are stored in DNS using Punycode encoding but displayed to users in their Unicode form.

An attacker registers a domain where one or more characters are replaced with visually similar characters from a different Unicode script:

Legitimate:  apple.com        (all Latin characters)
Attacker:    аpple.com        (Cyrillic а + Latin pple)
Punycode:    xn--pple-43d.com (the real DNS name)

The attack surface is enormous. Unicode 15.0 defines over 140,000 characters across 160+ scripts. Many share similar or identical glyphs:

Latin	Lookalike	Script	Code Point
a	а	Cyrillic	U+0430
e	е	Cyrillic	U+0435
o	о	Cyrillic	U+043E
c	с	Cyrillic	U+0441
p	р	Cyrillic	U+0440
x	х	Cyrillic	U+0445
i	і	Ukrainian	U+0456
B	В	Cyrillic	U+0412
H	Η	Greek	U+0397
n	η	Greek	U+03B7
0	О	Cyrillic O	U+041E
1	l	Latin l	—

A sophisticated attacker can construct domains using only Cyrillic characters where every character has a Latin lookalike, making the entire domain appear Latin:

аррlе.com → xn--80ak6aa92e.com  (entirely Cyrillic lookalikes)

This was demonstrated in 2017 by security researcher Xudong Zheng, who registered xn--80ak6aa92e.com and showed it displayed as apple.com in Chrome and Firefox before both browsers patched their display logic.

Browser Defenses

Modern browsers implement several heuristics to decide when to display a domain in Unicode vs. Punycode (the ASCII form, which cannot be spoofed):

The Chrome/Firefox Mixed-Script Rule

If a domain label contains characters from more than one script (e.g., Cyrillic mixed with Latin), display it in Punycode:

аpple.com  →  displayed as xn--pple-43d.com  (Cyrillic + Latin = suspicious)
аррlе.com  →  MAY display as аррlе.com (all Cyrillic — ambiguous)

The all-Cyrillic case (аррlе.com) is still dangerous because users in Cyrillic-script regions may see it as a normal internationalized domain. Chrome's approach for .com, .net, .org (and other "highly spoofable" TLDs) is to display Punycode if the domain could be spoofed.

The ICANN Restriction Policy

ICANN's IDN implementation guidelines restrict which characters may be mixed within a single domain label. Registrars complying with these guidelines will reject registrations that mix scripts (e.g., Latin + Cyrillic in the same label).

However, compliance is not universal, and the rules only apply at the label level — a domain with separate labels in different scripts may still pass (münchen.berlin uses German and an English TLD, which is legitimate).

Browser's Safe Character Lists

Chrome maintains a list of scripts whose characters are "safe" to display without Punycode based on the browser's locale settings. Characters used by the OS's locale are more likely to be displayed in Unicode. This is why the attack was more effective in non-Latin-script locales.

Detecting Homograph Attacks in Code

The Unicode Confusables Dataset

The Unicode Consortium maintains the confusables.txt data file, which maps characters to their "confusable" counterparts. The file lists character pairs that have been identified as visually similar enough to cause confusion:

# Using the 'confusable-homoglyphs' package:
# pip install confusable-homoglyphs
from confusable_homoglyphs import confusables, categories

# Check if a string contains confusable characters
confusables.is_confusable("pаypal.com", preferred_aliases=['latin'])
# True — the 'а' (Cyrillic) is confusable with 'a' (Latin)

# Get the list of confusable characters for a string
result = confusables.is_confusable("аpple", preferred_aliases=['latin'])
if result:
    for confusable in result:
        print(f"Position {confusable['pos']}: '{confusable['character']}' "
              f"(U+{ord(confusable['character']):04X}) "
              f"is confusable with Latin characters")

Script-mixing detection

import unicodedata
import idna

def get_script(char: str) -> str:
    """Get the Unicode script of a character (simplified)."""
    # In production, use the 'unicodedata2' or 'regex' package for full script data
    cp = ord(char)
    if 0x0041 <= cp <= 0x007A: return 'Latin'
    if 0x00C0 <= cp <= 0x024F: return 'Latin'  # extended Latin
    if 0x0400 <= cp <= 0x04FF: return 'Cyrillic'
    if 0x0370 <= cp <= 0x03FF: return 'Greek'
    if 0x0600 <= cp <= 0x06FF: return 'Arabic'
    if 0x4E00 <= cp <= 0x9FFF: return 'Han'
    if 0x3040 <= cp <= 0x30FF: return 'Japanese'
    return 'Other'

def is_mixed_script_domain(domain: str) -> bool:
    """
    Returns True if the domain label contains mixed scripts,
    which is a strong indicator of a homograph attack.
    """
    # Normalize: decode Punycode labels
    try:
        display_domain = idna.decode(domain.encode('ascii'))
    except (idna.core.InvalidCodepoint, UnicodeError):
        display_domain = domain

    for label in display_domain.split('.')[:-1]:  # exclude TLD
        scripts = set()
        for char in label:
            if char.isalpha():
                scripts.add(get_script(char))
        if len(scripts) > 1:
            return True
    return False

is_mixed_script_domain("аpple.com")    # True  — Cyrillic а + Latin pple
is_mixed_script_domain("münchen.de")   # False — all German/Latin
is_mixed_script_domain("apple.com")    # False — all Latin

Full homograph detection for security-sensitive contexts

import unicodedata
from urllib.parse import urlparse

def validate_url_safety(url: str, expected_domain: str) -> tuple[bool, str]:
    """
    Validate that a URL's domain is not a homograph spoof of the expected domain.
    Returns (is_safe, reason).
    """
    parsed = urlparse(url)
    hostname = parsed.hostname or ''

    # Decode Punycode for inspection
    try:
        display_host = hostname.encode('ascii').decode('ascii')
        import idna
        display_host = idna.decode(hostname.encode('ascii'))
    except Exception:
        display_host = hostname

    # Check for non-ASCII characters in the domain
    non_ascii = [(i, c) for i, c in enumerate(display_host) if ord(c) > 127]
    if not non_ascii:
        # ASCII-only domain — direct comparison
        if display_host.lower() == expected_domain.lower():
            return True, "Domain matches"
        return False, f"Domain mismatch: {display_host} != {expected_domain}"

    # Contains non-ASCII — check for confusables
    # In production: use confusable_homoglyphs library here
    return False, f"Non-ASCII characters in domain: {display_host}"

Application-Level Defenses

1. Never trust display form; use ACE form for comparison

import idna

def normalize_domain_for_comparison(domain: str) -> str:
    """
    Convert any domain to its ACE (ASCII Compatible Encoding) form.
    Use this when comparing domains for security purposes.
    """
    labels = domain.lower().split('.')
    ace_labels = []
    for label in labels:
        try:
            ace_labels.append(idna.encode(label).decode('ascii'))
        except (idna.core.InvalidCodepoint, UnicodeError):
            ace_labels.append(label)
    return '.'.join(ace_labels)

# Two visually similar domains normalize to different ACE forms:
normalize_domain_for_comparison("apple.com")   # 'apple.com'
normalize_domain_for_comparison("аpple.com")   # 'xn--pple-43d.com'  ← different!

2. Allowlist approach for sensitive operations

For high-security applications (banking, OAuth redirects, email link validation), use an allowlist of pre-verified domains rather than trying to detect all possible spoofs:

ALLOWED_DOMAINS = frozenset({
    'example.com',
    'api.example.com',
    'cdn.example.com',
})

def is_allowed_redirect(url: str) -> bool:
    parsed = urlparse(url)
    hostname = (parsed.hostname or '').lower()
    # Always compare ACE form
    ace_hostname = normalize_domain_for_comparison(hostname)
    return ace_hostname in ALLOWED_DOMAINS

3. Display Punycode for unrecognized internationalized domains

If your application displays URLs to users (in emails, notifications, admin panels), show Punycode for any domain containing non-ASCII characters unless the domain is in your verified allowlist:

def safe_display_url(url: str, trusted_domains: set[str]) -> str:
    """Display URL with Punycode for untrusted IDNs."""
    parsed = urlparse(url)
    hostname = parsed.hostname or ''

    if any(ord(c) > 127 for c in hostname):
        ace = normalize_domain_for_comparison(hostname)
        if ace not in trusted_domains:
            # Replace with ACE form in displayed URL
            return url.replace(hostname, ace)

    return url

Zero-Width and Invisible Characters

Homograph attacks using lookalike characters are detectable by careful visual inspection under the right font. A more subtle variant uses zero-width characters that are completely invisible:

U+200B ZERO WIDTH SPACE
U+200C ZERO WIDTH NON-JOINER
U+200D ZERO WIDTH JOINER
U+2060 WORD JOINER
U+FEFF ZERO WIDTH NO-BREAK SPACE (BOM)

An attacker could insert apple.com (with a zero-width space after apple) — visually identical to apple.com but a different string. Always strip zero-width characters from security-sensitive inputs:

ZERO_WIDTH_CHARS = {
    '\u200B',  # ZERO WIDTH SPACE
    '\u200C',  # ZERO WIDTH NON-JOINER
    '\u200D',  # ZERO WIDTH JOINER
    '\u2060',  # WORD JOINER
    '\uFEFF',  # BOM / ZERO WIDTH NO-BREAK SPACE
    '\u00AD',  # SOFT HYPHEN
}

def strip_invisible(text: str) -> str:
    return ''.join(c for c in text if c not in ZERO_WIDTH_CHARS)

Testing Your Defenses

To verify your application correctly rejects or flags homograph attempts, test against known confusable characters:

# Test vectors for homograph detection
TEST_CASES = [
    # (input, expected_safe, description)
    ("apple.com",           True,  "ASCII — baseline"),
    ("аpple.com",           False, "Cyrillic а (U+0430)"),
    ("аррlе.com",           False, "All Cyrillic lookalikes"),
    ("münchen.de",          True,  "Legitimate German IDN"),
    ("amazon\u200b.com",    False, "Zero-width space injection"),
    ("paypal\u00AD.com",    False, "Soft hyphen injection"),
    ("xn--pple-43d.com",    False, "Punycode of Cyrillic а+pple"),
]

def run_homograph_tests(detection_fn):
    for domain, expected, description in TEST_CASES:
        result = detection_fn(domain)
        status = "PASS" if (result == expected) else "FAIL"
        print(f"{status}  {description}: {repr(domain)}")

# Usage:
run_homograph_tests(lambda d: not is_mixed_script_domain(d))

For a quick manual check, use the SymbolFYI Character Counter to reveal invisible and zero-width characters in a string — paste a suspicious URL to see exactly what code points it contains. The code point view will immediately show whether a is U+0061 (Latin) or U+0430 (Cyrillic).

Next in Series: Web Fonts and Unicode Subsetting: Loading Only What You Need — how unicode-range descriptors allow browsers to download only the font data needed for the characters actually present on the page.