Zero vs Letter O: Unicode Confusables and Homograph Attacks

Reference Symbol Showdown Tháng 10 10, 2023

○ 1. En Dash vs Em Dash: When to Use – and —
○ 2. Curly Quotes vs Straight Quotes: Typography's Most Common Mix-Up
○ 3. Ellipsis (…) vs Three Dots (...): One Character or Three?
○ 4. Multiplication Sign (×) vs Letter X: Spot the Difference
○ 5. Minus vs Hyphen vs Dash: Five Characters That Look Like a Line
● 6. Zero vs Letter O: Unicode Confusables and Homograph Attacks
○ 7. Space Characters in Unicode: 20+ Invisible Characters Compared
○ 8. Bullet (•) vs Middle Dot (·): Small Dots, Big Differences

Mục lục

Can you tell the difference between these three characters?

0   O   О

The first is the digit zero. The second is the Latin capital letter O. The third is the Cyrillic capital letter O (used in Russian, Bulgarian, Ukrainian, and other Slavic languages). In most common fonts at body text size, they look identical. In security research, this trio is notorious. In everyday web development, confusing them causes subtle bugs that are infuriating to debug.

This article explores where the confusion comes from, how it's exploited, and how to detect and defend against confusable character mix-ups.

The Characters

Character	Name	Unicode	Script	Category
0	Digit Zero	U+0030	Common	Decimal Number (Nd)
O	Latin Capital Letter O	U+004F	Latin	Uppercase Letter (Lu)
o	Latin Small Letter O	U+006F	Latin	Lowercase Letter (Ll)
О	Cyrillic Capital Letter O	U+041E	Cyrillic	Uppercase Letter (Lu)
о	Cyrillic Small Letter O	U+043E	Cyrillic	Lowercase Letter (Ll)
ο	Greek Small Letter Omicron	U+03BF	Greek	Lowercase Letter (Ll)
Ο	Greek Capital Letter Omicron	U+039F	Greek	Uppercase Letter (Lu)
０	Fullwidth Digit Zero	U+FF10	Common	Decimal Number (Nd)
Ｏ	Fullwidth Latin Capital Letter O	U+FF2F	Latin	Uppercase Letter (Lu)

The confusable set is larger than most people realize. Latin O, Cyrillic О, and Greek Ο (omicron) are visually indistinguishable in most sans-serif fonts. Add the fullwidth variants and you have at least nine characters that render as the same round shape.

Why Fonts (Sometimes) Distinguish Zero from O

The Slashed Zero

The most common typographic solution for distinguishing the digit zero from the letter O is the slashed zero (0̸): a zero with a diagonal slash through it. This convention originated with handwriting standards and was adopted in contexts where ambiguity is costly:

Aviation and military communication (where misreading a code can be catastrophic)
Technical drawings and engineering documents
Early computer fonts designed for programmers

Many monospace programming fonts include a slashed or dotted zero:

JetBrains Mono: Slashed zero variant available
Fira Code: Slashed or dotted zero
Cascadia Code: Dotted zero
Consolas: Slightly different proportions to aid distinction
Courier New: Traditional monospace — zero and O are visually similar

The Dotted Zero

An alternative to the slash: a dot inside the zero ⊙. Fonts that use this approach include Input Mono and several terminal-specific typefaces.

Variable Proportions

In proportional (non-monospace) fonts, zero is typically slightly narrower than O. This is the primary visual distinction in body text fonts — but it's subtle enough that most readers will not consciously notice it.

The IDN Homograph Attack

This is where the zero-vs-O confusion becomes a serious security vulnerability. The Internationalized Domain Name (IDN) homograph attack exploits the fact that Unicode allows domain names to contain characters from any script — and that characters from different scripts can look identical.

How It Works

An attacker registers a domain where one or more Latin characters are replaced by visually identical characters from another script (most commonly Cyrillic):

Legitimate: apple.com (all Latin characters) Malicious: аpple.com (Cyrillic 'а' U+0430 instead of Latin 'a')

The malicious domain is rendered in browsers as apple.com to most users — the Cyrillic а is visually identical to the Latin a in most fonts.
Users click a link to what they believe is a trusted site, enter credentials, and are phished.

Zero Specifically in Homograph Attacks

The digit 0 is used in homograph attacks in a slightly different way — replacing the letter O in non-IDN contexts:

Legitimate: https://accounts.google.com
Malicious:  https://acc0unts.g00gle.com   (zeros replacing letters)

This is less about IDN and more about visual confusion in phishing emails and SMS messages, where character rendering may be poor or the user is reading quickly.

The Cyrillic О is used more specifically in IDN attacks because it's a Unicode letter from a legitimate script, whereas using digit zero as a letter is an obvious substitution that doesn't pass IDN validation.

Real-World Example

In 2017, security researcher Xudong Zheng demonstrated a practical IDN homograph attack against Chrome and Firefox. The proof-of-concept used a domain made entirely of Cyrillic characters that rendered visually as apple.com in the browser's address bar:

xn--80ak6aa92e.com

This punycode-encoded domain decoded to what looked exactly like apple.com using Cyrillic characters for every letter. Chrome has since improved its handling of mixed-script domains, but the underlying problem of visually identical characters from different scripts remains unsolved at the Unicode level.

Unicode's Official Confusable Data

The Unicode Consortium maintains an official confusables dataset — a machine-readable list of character pairs that are visually similar. It is part of the Unicode Security Mechanisms (Unicode Technical Standard #39).

The confusables data lists, for example:

0 (U+0030)  →  confusable with:  O (U+004F), О (U+041E), Ο (U+039F)
O (U+004F)  →  confusable with:  0 (U+0030), О (U+041E), Ο (U+039F)

This dataset is used by domain registrars, browser address bars, and security tools to detect potential homograph attacks.

You can query it programmatically using Python's unicodedata module or the dedicated confusable-homoglyphs package:

from confusable_homoglyphs import confusables

# Check if two characters are confusable
confusables.is_confusable('0', preferred_aliases=['latin'])
# Returns a list of confusable character data if any are found

# Get all characters confusable with 'O'
confusables.is_confusable('O', preferred_aliases=['latin'])

Detecting Mixed-Script Text

Browser and OS-Level Detection

Modern browsers flag IDN domains that mix scripts (e.g., Latin + Cyrillic) in the address bar. Chrome, Firefox, and Safari will show the punycode form (xn--...) rather than the rendered Unicode domain when mixed scripts are detected.

However, this protection applies only to domain names — not to email addresses, link text, passwords, or body content.

Application-Level Detection

If you're building any application that processes user-submitted text (URLs, usernames, email addresses), mixed-script detection is an important security layer:

import unicodedata
from typing import Optional

def get_script(char: str) -> Optional[str]:
    """
    Approximate script detection via Unicode character name.
    For production use, consider the 'regex' package with \p{Script=Latin} etc.
    """
    name = unicodedata.name(char, '').upper()
    if 'LATIN' in name:
        return 'Latin'
    if 'CYRILLIC' in name:
        return 'Cyrillic'
    if 'GREEK' in name:
        return 'Greek'
    if 'ARABIC' in name:
        return 'Arabic'
    if 'DIGIT' in name or 'NUMBER' in name:
        return 'Common'
    return 'Other'

def is_mixed_script(text: str) -> bool:
    """
    Returns True if the text contains characters from more than one
    non-Common script — a potential homograph attack indicator.
    """
    scripts = {
        get_script(c)
        for c in text
        if get_script(c) not in ('Common', 'Other', None)
    }
    return len(scripts) > 1

# Test
print(is_mixed_script('google.com'))   # False — all Latin + Common
print(is_mixed_script('gооgle.com'))   # True — Cyrillic о mixed with Latin

Username Validation

For username fields, common defenses include:

Whitelist allowed character sets: Only allow [a-zA-Z0-9_-] for ASCII-only systems
Normalize to NFKC: Unicode normalization can collapse some confusables but not all (Cyrillic О does not normalize to Latin O)
Script restriction: Require all non-digit characters to come from the same Unicode script
Confusable detection: Reject usernames that are confusable with existing registered usernames

import re
import unicodedata

def is_safe_username(username: str) -> bool:
    """
    Basic safe username validation:
    - Allow only ASCII letters, digits, underscores, hyphens
    - Reject any characters outside that set
    """
    return bool(re.match(r'^[a-zA-Z0-9_-]+$', username))

def normalize_username(username: str) -> str:
    """
    NFKC normalization + lowercase for comparison purposes.
    Does NOT collapse all confusables.
    """
    return unicodedata.normalize('NFKC', username).lower()

Font Design and the Zero Problem

The visual similarity of 0 and O is not accidental — it reflects a fundamental tension in Latin typography:

The uppercase letter O is designed to be a near-perfect oval
The digit zero is also a near-perfect oval
Historically, some metal typefaces literally used the same physical type for both

The distinction became critically important with computers, where misreading O for 0 (or vice versa) causes program errors. This drove the adoption of slashed zeros and other disambiguation strategies in monospace fonts.

For web projects where code will be displayed — documentation sites, code playgrounds, terminal emulators — using a coding font with a clearly distinct zero is a concrete UX improvement:

/* For code blocks and monospace content */
code, pre, .terminal {
  font-family: 'JetBrains Mono', 'Fira Code', 'Cascadia Code',
               'Source Code Pro', Consolas, monospace;
}

Use our Character Analyzer to paste any suspicious character and confirm its exact Unicode code point and script — this is the most reliable way to identify whether a character that looks like O is actually the Latin letter, Cyrillic letter, or digit zero.

HTML Encoding Pitfalls

When displaying user-submitted content containing characters from mixed scripts, ensure proper encoding to prevent unexpected rendering:

<!-- Properly encoded — renders correctly regardless of script -->
<span>&#1054;</span>  <!-- Cyrillic О (U+041E) -->
<span>O</span>        <!-- Latin O (U+004F) — or use &#79; -->
<span>0</span>        <!-- Digit Zero (U+0030) — or use &#48; -->

When rendering any user-provided text in HTML, always escape it properly to prevent injection, but be aware that proper escaping does not address the visual confusability problem — a properly escaped Cyrillic О still looks like a Latin O to human readers.

Quick Reference: Common Confusable Sets

Beyond zero and O, the zero-vs-O family is part of a much larger confusable landscape:

Confusable Group	Characters Involved	Security Risk
0 / O / О / Ο	Digit, Latin, Cyrillic, Greek O	High (IDN, phishing)
l / 1 / I / ӏ	Lowercase L, digit 1, capital I, Cyrillic palochka	High (phishing, user IDs)
rn / m	Two characters "rn" vs letter "m"	Medium (domain spoofing)
C / С / Ϲ	Latin, Cyrillic, Greek C/Sigma	Medium (IDN)
a / а / α	Latin, Cyrillic, Greek alpha	High (IDN)
e / е / ε	Latin, Cyrillic, Greek epsilon	High (IDN)
p / р	Latin, Cyrillic	High (IDN)
H / Н	Latin, Cyrillic	Medium (IDN)

Next in Series: Unicode has more than 20 different space characters, and they all behave differently. See Space Characters in Unicode: 20+ Invisible Characters Compared.