SymbolFYI

Confusables (Homoglyphs)

Unicode Standard
Definição

Characters that look similar or identical but have different code points (e.g., Latin 'A' U+0041 vs Cyrillic 'А' U+0410).

What Are Confusables?

Confusables (also called visually confusable characters or homoglyphs) are Unicode characters from different code points — often from different scripts — that appear identical or nearly identical when rendered in typical fonts. For example, the Latin lowercase a (U+0061) and the Cyrillic lowercase а (U+0430) look the same in most typefaces, yet they are completely distinct code points with different script properties.

Unicode maintains an official list of confusables in the file confusables.txt, published as part of Unicode Security Mechanisms (Unicode Standard Annex #39).

Common Confusable Pairs

Character A Code Script Character B Code Script
a U+0061 Latin а U+0430 Cyrillic
o U+006F Latin о U+043E Cyrillic
e U+0065 Latin е U+0435 Cyrillic
p U+0070 Latin р U+0440 Cyrillic
I U+0049 Latin І U+0406 Cyrillic
0 (zero) U+0030 Common О U+041E Cyrillic O
1 (one) U+0031 Common l U+006C Latin l
rn U+0072+006E Latin m U+006D Latin

Security Implications: IDN Homograph Attacks

The most significant security use case for confusables is the Internationalized Domain Name (IDN) homograph attack. An attacker registers a domain like аpple.com where the а is Cyrillic, which appears identical to apple.com in most browsers' address bars.

Modern browsers implement multiple defenses: 1. Domains mixing scripts are shown in punycode (e.g., xn--pple-43d.com) when the mix is suspicious. 2. Chrome, Firefox, and Safari maintain allowlists of trusted script combinations. 3. ICANN guidelines restrict which script combinations are permitted in registered domain names.

Unicode Security Mechanisms

# Checking for confusables using the 'confusable-homoglyphs' package
# pip install confusable-homoglyphs
from confusable_homoglyphs import confusables

result = confusables.is_dangerous('microsоft.com')  # Cyrillic о
print(result)  # True

result2 = confusables.is_dangerous('microsoft.com')  # All Latin
print(result2)  # False

# Get the list of confusable characters for a given character
matches = confusables.is_confusable('о', preferred_aliases=['latin'])
print(matches)  # Shows Cyrillic о is confusable with Latin o
// Normalize identifiers for comparison
// NFKC normalization handles some confusable normalization
function normalizeIdentifier(str) {
  return str.normalize('NFKC');
}

// Detect suspicious mixed-script usernames
function hasMixedScripts(str) {
  const latinPattern = /\p{Script=Latin}/u;
  const cyrillicPattern = /\p{Script=Cyrillic}/u;
  return latinPattern.test(str) && cyrillicPattern.test(str);
}

console.log(hasMixedScripts('аpple'));  // true (Cyrillic а + Latin)
console.log(hasMixedScripts('apple'));  // false

Confusables in Usernames and Identifiers

Beyond domain names, confusables are exploited in: - Username squatting: Registering usernames that look identical to popular accounts. - Code injection: Inserting lookalike characters into variable names in source code. - Phishing: Crafting URLs or display names that appear legitimate.

Unicode Technical Report #36 (Unicode Security Considerations) and UTS #39 (Unicode Security Mechanisms) provide detailed guidelines for building secure identifiers and recommend using skeleton algorithms that map strings to a canonical confusable-normalized form before comparison.

Símbolos relacionados

Termos relacionados

Ferramentas relacionadas

Guias relacionados