Zero vs Letter O: Unicode Confusables and Homograph Attacks
- ○ 1. En Dash vs Em Dash: When to Use – and —
- ○ 2. Curly Quotes vs Straight Quotes: Typography's Most Common Mix-Up
- ○ 3. Ellipsis (…) vs Three Dots (...): One Character or Three?
- ○ 4. Multiplication Sign (×) vs Letter X: Spot the Difference
- ○ 5. Minus vs Hyphen vs Dash: Five Characters That Look Like a Line
- ● 6. Zero vs Letter O: Unicode Confusables and Homograph Attacks
- ○ 7. Space Characters in Unicode: 20+ Invisible Characters Compared
- ○ 8. Bullet (•) vs Middle Dot (·): Small Dots, Big Differences
Can you tell the difference between these three characters?
0 O О
The first is the digit zero. The second is the Latin capital letter O. The third is the Cyrillic capital letter O (used in Russian, Bulgarian, Ukrainian, and other Slavic languages). In most common fonts at body text size, they look identical. In security research, this trio is notorious. In everyday web development, confusing them causes subtle bugs that are infuriating to debug.
This article explores where the confusion comes from, how it's exploited, and how to detect and defend against confusable character mix-ups.
The Characters
| Character | Name | Unicode | Script | Category |
|---|---|---|---|---|
| 0 | Digit Zero | U+0030 | Common | Decimal Number (Nd) |
| O | Latin Capital Letter O | U+004F | Latin | Uppercase Letter (Lu) |
| o | Latin Small Letter O | U+006F | Latin | Lowercase Letter (Ll) |
| О | Cyrillic Capital Letter O | U+041E | Cyrillic | Uppercase Letter (Lu) |
| о | Cyrillic Small Letter O | U+043E | Cyrillic | Lowercase Letter (Ll) |
| ο | Greek Small Letter Omicron | U+03BF | Greek | Lowercase Letter (Ll) |
| Ο | Greek Capital Letter Omicron | U+039F | Greek | Uppercase Letter (Lu) |
| 0 | Fullwidth Digit Zero | U+FF10 | Common | Decimal Number (Nd) |
| O | Fullwidth Latin Capital Letter O | U+FF2F | Latin | Uppercase Letter (Lu) |
The confusable set is larger than most people realize. Latin O, Cyrillic О, and Greek Ο (omicron) are visually indistinguishable in most sans-serif fonts. Add the fullwidth variants and you have at least nine characters that render as the same round shape.
Why Fonts (Sometimes) Distinguish Zero from O
The Slashed Zero
The most common typographic solution for distinguishing the digit zero from the letter O is the slashed zero (0̸): a zero with a diagonal slash through it. This convention originated with handwriting standards and was adopted in contexts where ambiguity is costly:
- Aviation and military communication (where misreading a code can be catastrophic)
- Technical drawings and engineering documents
- Early computer fonts designed for programmers
Many monospace programming fonts include a slashed or dotted zero:
- JetBrains Mono: Slashed zero variant available
- Fira Code: Slashed or dotted zero
- Cascadia Code: Dotted zero
- Consolas: Slightly different proportions to aid distinction
- Courier New: Traditional monospace — zero and O are visually similar
The Dotted Zero
An alternative to the slash: a dot inside the zero ⊙. Fonts that use this approach include Input Mono and several terminal-specific typefaces.
Variable Proportions
In proportional (non-monospace) fonts, zero is typically slightly narrower than O. This is the primary visual distinction in body text fonts — but it's subtle enough that most readers will not consciously notice it.
The IDN Homograph Attack
This is where the zero-vs-O confusion becomes a serious security vulnerability. The Internationalized Domain Name (IDN) homograph attack exploits the fact that Unicode allows domain names to contain characters from any script — and that characters from different scripts can look identical.
How It Works
- An attacker registers a domain where one or more Latin characters are replaced by visually identical characters from another script (most commonly Cyrillic):
Legitimate: apple.com (all Latin characters)
Malicious: аpple.com (Cyrillic 'а' U+0430 instead of Latin 'a')
-
The malicious domain is rendered in browsers as
apple.comto most users — the Cyrillicаis visually identical to the Latinain most fonts. -
Users click a link to what they believe is a trusted site, enter credentials, and are phished.
Zero Specifically in Homograph Attacks
The digit 0 is used in homograph attacks in a slightly different way — replacing the letter O in non-IDN contexts:
Legitimate: https://accounts.google.com
Malicious: https://acc0unts.g00gle.com (zeros replacing letters)
This is less about IDN and more about visual confusion in phishing emails and SMS messages, where character rendering may be poor or the user is reading quickly.
The Cyrillic О is used more specifically in IDN attacks because it's a Unicode letter from a legitimate script, whereas using digit zero as a letter is an obvious substitution that doesn't pass IDN validation.
Real-World Example
In 2017, security researcher Xudong Zheng demonstrated a practical IDN homograph attack against Chrome and Firefox. The proof-of-concept used a domain made entirely of Cyrillic characters that rendered visually as apple.com in the browser's address bar:
xn--80ak6aa92e.com
This punycode-encoded domain decoded to what looked exactly like apple.com using Cyrillic characters for every letter. Chrome has since improved its handling of mixed-script domains, but the underlying problem of visually identical characters from different scripts remains unsolved at the Unicode level.
Unicode's Official Confusable Data
The Unicode Consortium maintains an official confusables dataset — a machine-readable list of character pairs that are visually similar. It is part of the Unicode Security Mechanisms (Unicode Technical Standard #39).
The confusables data lists, for example:
0 (U+0030) → confusable with: O (U+004F), О (U+041E), Ο (U+039F)
O (U+004F) → confusable with: 0 (U+0030), О (U+041E), Ο (U+039F)
This dataset is used by domain registrars, browser address bars, and security tools to detect potential homograph attacks.
You can query it programmatically using Python's unicodedata module or the dedicated confusable-homoglyphs package:
from confusable_homoglyphs import confusables
# Check if two characters are confusable
confusables.is_confusable('0', preferred_aliases=['latin'])
# Returns a list of confusable character data if any are found
# Get all characters confusable with 'O'
confusables.is_confusable('O', preferred_aliases=['latin'])
Detecting Mixed-Script Text
Browser and OS-Level Detection
Modern browsers flag IDN domains that mix scripts (e.g., Latin + Cyrillic) in the address bar. Chrome, Firefox, and Safari will show the punycode form (xn--...) rather than the rendered Unicode domain when mixed scripts are detected.
However, this protection applies only to domain names — not to email addresses, link text, passwords, or body content.
Application-Level Detection
If you're building any application that processes user-submitted text (URLs, usernames, email addresses), mixed-script detection is an important security layer:
import unicodedata
from typing import Optional
def get_script(char: str) -> Optional[str]:
"""
Approximate script detection via Unicode character name.
For production use, consider the 'regex' package with \p{Script=Latin} etc.
"""
name = unicodedata.name(char, '').upper()
if 'LATIN' in name:
return 'Latin'
if 'CYRILLIC' in name:
return 'Cyrillic'
if 'GREEK' in name:
return 'Greek'
if 'ARABIC' in name:
return 'Arabic'
if 'DIGIT' in name or 'NUMBER' in name:
return 'Common'
return 'Other'
def is_mixed_script(text: str) -> bool:
"""
Returns True if the text contains characters from more than one
non-Common script — a potential homograph attack indicator.
"""
scripts = {
get_script(c)
for c in text
if get_script(c) not in ('Common', 'Other', None)
}
return len(scripts) > 1
# Test
print(is_mixed_script('google.com')) # False — all Latin + Common
print(is_mixed_script('gооgle.com')) # True — Cyrillic о mixed with Latin
Username Validation
For username fields, common defenses include:
- Whitelist allowed character sets: Only allow
[a-zA-Z0-9_-]for ASCII-only systems - Normalize to NFKC: Unicode normalization can collapse some confusables but not all (Cyrillic О does not normalize to Latin O)
- Script restriction: Require all non-digit characters to come from the same Unicode script
- Confusable detection: Reject usernames that are confusable with existing registered usernames
import re
import unicodedata
def is_safe_username(username: str) -> bool:
"""
Basic safe username validation:
- Allow only ASCII letters, digits, underscores, hyphens
- Reject any characters outside that set
"""
return bool(re.match(r'^[a-zA-Z0-9_-]+$', username))
def normalize_username(username: str) -> str:
"""
NFKC normalization + lowercase for comparison purposes.
Does NOT collapse all confusables.
"""
return unicodedata.normalize('NFKC', username).lower()
Font Design and the Zero Problem
The visual similarity of 0 and O is not accidental — it reflects a fundamental tension in Latin typography:
- The uppercase letter O is designed to be a near-perfect oval
- The digit zero is also a near-perfect oval
- Historically, some metal typefaces literally used the same physical type for both
The distinction became critically important with computers, where misreading O for 0 (or vice versa) causes program errors. This drove the adoption of slashed zeros and other disambiguation strategies in monospace fonts.
For web projects where code will be displayed — documentation sites, code playgrounds, terminal emulators — using a coding font with a clearly distinct zero is a concrete UX improvement:
/* For code blocks and monospace content */
code, pre, .terminal {
font-family: 'JetBrains Mono', 'Fira Code', 'Cascadia Code',
'Source Code Pro', Consolas, monospace;
}
Use our Character Analyzer to paste any suspicious character and confirm its exact Unicode code point and script — this is the most reliable way to identify whether a character that looks like O is actually the Latin letter, Cyrillic letter, or digit zero.
HTML Encoding Pitfalls
When displaying user-submitted content containing characters from mixed scripts, ensure proper encoding to prevent unexpected rendering:
<!-- Properly encoded — renders correctly regardless of script -->
<span>О</span> <!-- Cyrillic О (U+041E) -->
<span>O</span> <!-- Latin O (U+004F) — or use O -->
<span>0</span> <!-- Digit Zero (U+0030) — or use 0 -->
When rendering any user-provided text in HTML, always escape it properly to prevent injection, but be aware that proper escaping does not address the visual confusability problem — a properly escaped Cyrillic О still looks like a Latin O to human readers.
Quick Reference: Common Confusable Sets
Beyond zero and O, the zero-vs-O family is part of a much larger confusable landscape:
| Confusable Group | Characters Involved | Security Risk |
|---|---|---|
| 0 / O / О / Ο | Digit, Latin, Cyrillic, Greek O | High (IDN, phishing) |
| l / 1 / I / ӏ | Lowercase L, digit 1, capital I, Cyrillic palochka | High (phishing, user IDs) |
| rn / m | Two characters "rn" vs letter "m" | Medium (domain spoofing) |
| C / С / Ϲ | Latin, Cyrillic, Greek C/Sigma | Medium (IDN) |
| a / а / α | Latin, Cyrillic, Greek alpha | High (IDN) |
| e / е / ε | Latin, Cyrillic, Greek epsilon | High (IDN) |
| p / р | Latin, Cyrillic | High (IDN) |
| H / Н | Latin, Cyrillic | Medium (IDN) |
Next in Series: Unicode has more than 20 different space characters, and they all behave differently. See Space Characters in Unicode: 20+ Invisible Characters Compared.