IDN Homograph Attacks
An IDN (Internationalized Domain Name) homograph attack exploits the visual similarity between Unicode characters from different scripts to register domain names that appear identical to legitimate ones. These attacks are enabled by the Punycode encoding system that allows Unicode characters in domain names and are among the most sophisticated phishing techniques.
How Homograph Attacks Work
Many Unicode characters look visually identical or nearly identical to common Latin characters:
Latin a (U+0061) vs. Cyrillic а (U+0430) → identical in many fonts
Latin e (U+0065) vs. Cyrillic е (U+0435) → identical in many fonts
Latin o (U+006F) vs. Cyrillic о (U+043E) → identical in many fonts
Latin p (U+0070) vs. Cyrillic р (U+0440) → identical in many fonts
Latin c (U+0063) vs. Cyrillic с (U+0441) → identical in many fonts
An attacker can register аррӏе.com (all Cyrillic characters) which looks identical to apple.com but resolves to a completely different IP address:
import encodings.idna
legitimate = 'apple.com'
attack = '\u0430\u0440\u0440\u04CF\u0435.com' # Cyrillic apple
# They look the same but are different
print(legitimate == attack) # False
# Punycode reveals the difference
print(attack.encode('idna')) # b'xn--80ak6aa92e.com'
print(legitimate.encode('idna')) # b'apple.com'
Browser Defenses
Browsers have implemented various strategies to mitigate homograph attacks:
Chrome
Chrome displays domains in Punycode form (xn--...) when the domain contains:
- Characters from scripts not used in the user's top preferred languages
- Mixed-script labels (e.g., Latin + Cyrillic in the same label)
- Characters that are confusable with ASCII based on the IDNA confusables list
Firefox
Firefox uses an IDN display algorithm that compares domains against a whitelist of TLDs with strong registry policies and shows Punycode for suspicious combinations.
Safari
Safari displays Punycode for domains with mixed-script characters.
ICANN Policies
The Internet Corporation for Assigned Names and Numbers (ICANN) has established rules for IDN registration:
- Registry operators must implement "bundle" policies grouping visually similar characters
- Many registries prohibit mixed-script registrations entirely
- The ICANN IDN Guidelines prohibit registration of domain names that are confusable with existing delegated TLDs
Detecting Confusable Characters
Unicode provides a confusables database (Unicode Security Mechanisms, UTS #39) listing character pairs that are visually similar:
# Using the 'confusable-homoglyphs' package
from confusable_homoglyphs import confusables
# Check if a string contains potentially confusable characters
confusables.is_dangerous('аррӏе.com', preferred_aliases=['latin'])
# True — contains characters confusable with Latin
# Get confusable character groups
confusables.categories('а')
# Returns groups showing Cyrillic а is confusable with Latin a
Mitigation for Developers
Domain Validation
import unicodedata
import idna
def validate_domain(domain: str) -> bool:
try:
# Attempt IDNA encoding
encoded = idna.encode(domain, alec=True).decode('ascii')
# Check for mixed scripts in each label
for label in domain.rstrip('.').split('.'):
scripts = set()
for char in label:
name = unicodedata.name(char, '')
if 'LATIN' in name: scripts.add('latin')
elif 'CYRILLIC' in name: scripts.add('cyrillic')
elif 'GREEK' in name: scripts.add('greek')
if len(scripts) > 1:
return False # Mixed script — suspicious
return True
except (idna.core.InvalidCodepoint, UnicodeError):
return False
Display in Applications
When displaying user-supplied URLs, always show both the Unicode and Punycode forms for non-ASCII domains, or consistently show the Punycode form to prevent confusion.