SymbolFYI

Confusables (Homoglyphs)

Unicode Standard

Definição

Characters that look similar or identical but have different code points (e.g., Latin 'A' U+0041 vs Cyrillic 'А' U+0410).

What Are Confusables?

Confusables (also called visually confusable characters or homoglyphs) are Unicode characters from different code points — often from different scripts — that appear identical or nearly identical when rendered in typical fonts. For example, the Latin lowercase a (U+0061) and the Cyrillic lowercase а (U+0430) look the same in most typefaces, yet they are completely distinct code points with different script properties.

Unicode maintains an official list of confusables in the file confusables.txt, published as part of Unicode Security Mechanisms (Unicode Standard Annex #39).

Common Confusable Pairs

Character A	Code	Script	Character B	Code	Script
`a`	U+0061	Latin	`а`	U+0430	Cyrillic
`o`	U+006F	Latin	`о`	U+043E	Cyrillic
`e`	U+0065	Latin	`е`	U+0435	Cyrillic
`p`	U+0070	Latin	`р`	U+0440	Cyrillic
`I`	U+0049	Latin	`І`	U+0406	Cyrillic
`0` (zero)	U+0030	Common	`О`	U+041E	Cyrillic O
`1` (one)	U+0031	Common	`l`	U+006C	Latin l
`rn`	U+0072+006E	Latin	`m`	U+006D	Latin

Security Implications: IDN Homograph Attacks

The most significant security use case for confusables is the Internationalized Domain Name (IDN) homograph attack. An attacker registers a domain like аpple.com where the а is Cyrillic, which appears identical to apple.com in most browsers' address bars.

Modern browsers implement multiple defenses: 1. Domains mixing scripts are shown in punycode (e.g., xn--pple-43d.com) when the mix is suspicious. 2. Chrome, Firefox, and Safari maintain allowlists of trusted script combinations. 3. ICANN guidelines restrict which script combinations are permitted in registered domain names.

Unicode Security Mechanisms

# Checking for confusables using the 'confusable-homoglyphs' package
# pip install confusable-homoglyphs
from confusable_homoglyphs import confusables

result = confusables.is_dangerous('microsоft.com')  # Cyrillic о
print(result)  # True

result2 = confusables.is_dangerous('microsoft.com')  # All Latin
print(result2)  # False

# Get the list of confusable characters for a given character
matches = confusables.is_confusable('о', preferred_aliases=['latin'])
print(matches)  # Shows Cyrillic о is confusable with Latin o

// Normalize identifiers for comparison
// NFKC normalization handles some confusable normalization
function normalizeIdentifier(str) {
  return str.normalize('NFKC');
}

// Detect suspicious mixed-script usernames
function hasMixedScripts(str) {
  const latinPattern = /\p{Script=Latin}/u;
  const cyrillicPattern = /\p{Script=Cyrillic}/u;
  return latinPattern.test(str) && cyrillicPattern.test(str);
}

console.log(hasMixedScripts('аpple'));  // true (Cyrillic а + Latin)
console.log(hasMixedScripts('apple'));  // false

Confusables in Usernames and Identifiers

Beyond domain names, confusables are exploited in: - Username squatting: Registering usernames that look identical to popular accounts. - Code injection: Inserting lookalike characters into variable names in source code. - Phishing: Crafting URLs or display names that appear legitimate.

Unicode Technical Report #36 (Unicode Security Considerations) and UTS #39 (Unicode Security Mechanisms) provide detailed guidelines for building secure identifiers and recommend using skeleton algorithms that map strings to a canonical confusable-normalized form before comparison.

Confusables (Homoglyphs)

What Are Confusables?

Common Confusable Pairs

Security Implications: IDN Homograph Attacks

Unicode Security Mechanisms

Confusables in Usernames and Identifiers

Símbolos relacionados

Termos relacionados

Ferramentas relacionadas

Guias relacionados