Character Encoding Detection: How Browsers and Tools Guess Your Encoding

Reference Encoding Survival Guide 七月 9, 2024

○ 1. UTF-8: The Complete Guide to the Web's Dominant Encoding
○ 2. Mojibake: Why Text Turns to Garbage and How to Fix It
● 3. Character Encoding Detection: How Browsers and Tools Guess Your Encoding
○ 4. UTF-16 and Surrogate Pairs: Why JavaScript Strings Are Complicated
○ 5. Legacy Encodings: Latin-1, Windows-1252, Shift-JIS, and When You Still Need Them
○ 6. Punycode and IDN: How Unicode Domain Names Work

When a browser opens a file with no encoding declaration, it doesn't give up — it guesses. When the Python library chardet reports "ISO-8859-1 with 73% confidence," that confidence number comes from a real algorithm, not a random estimate. Understanding how encoding detection works helps you understand why it sometimes fails, when you can trust it, and why the answer is almost always to declare your encoding explicitly rather than rely on detection.

Why Detection Is Fundamentally Hard

Encoding detection is an underdetermined problem. Given a sequence of bytes, multiple encodings can produce valid, plausible text. The byte sequence 0xE9 0x74 0xE9 is valid Latin-1 (decodes to été), valid Windows-1252 (same), and would be valid as part of a longer UTF-8 sequence — but 0xE9 alone is not a valid UTF-8 sequence. It's also valid KOI8-R, decoding to something entirely different.

For purely ASCII content — emails with English text only, HTML with no accented characters, most source code — detection always succeeds because every encoding maps 0x00–0x7F to the same characters. The problem space only exists for bytes above 0x7F.

Short texts are particularly problematic. A 10-byte sequence provides little statistical information. "The" confidence in chardet output for short inputs is sometimes as low as 5%. A single 0xE9 byte could be Latin-1 é, Windows-1252 é, or the beginning of a multi-byte UTF-8 sequence that was truncated.

Some encodings are ambiguous by design. Latin-1 assigns a character to every possible byte value (0x00–0xFF), so any sequence of bytes is "valid" Latin-1. There's no structural way to prove a sequence isn't Latin-1 — you can only say it's more or less likely to be Latin-1 based on which characters appear.

The Browser Encoding Detection Algorithm

The HTML specification defines a precise encoding detection algorithm that browsers must implement. It proceeds through these steps in order, stopping at the first definitive result:

Step 1: BOM Sniffing

Check the first 2–4 bytes for a Byte Order Mark:

BOM bytes	Encoding
`EF BB BF`	UTF-8
`FF FE`	UTF-16 LE
`FE FF`	UTF-16 BE
`00 00 FE FF`	UTF-32 BE
`FF FE 00 00`	UTF-32 LE

A UTF-8 BOM is definitive. If found, the browser uses UTF-8 and discards the BOM bytes. UTF-16 BOMs are equally definitive. This step always wins if a BOM is present.

Step 2: HTTP Content-Type Header

The HTTP response header takes priority over everything in the document itself (except the BOM). If the server sends:

Content-Type: text/html; charset=windows-1251

The browser uses Windows-1251, even if the HTML contains <meta charset="utf-8">. This is the correct behavior — the server should know what encoding it's serving.

The charset parameter is matched case-insensitively and handles some alternate names:

charset=utf-8          → UTF-8
charset=UTF-8          → UTF-8
charset=iso-8859-1     → Windows-1252 (spec mandates this substitution)
charset=latin1         → Windows-1252 (same)
charset=x-sjis         → Shift-JIS

Note that iso-8859-1 is silently remapped to Windows-1252. The spec does this because Latin-1 and Windows-1252 are identical in the 0x00–0x7F range, and Windows-1252 is strictly a superset of Latin-1 in the 0x80–0x9F range (Latin-1 has control characters there; Windows-1252 has useful characters). Almost no content on the web actually uses those Latin-1 control characters.

Step 3: Meta Tag Prescan

Before the document is fully parsed, the browser scans the first 1024 bytes looking for a <meta charset> or <meta http-equiv="Content-Type"> declaration:

<meta charset="UTF-8">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

The prescan is intentionally simplified — it doesn't build a full DOM tree. It looks for the byte sequence <meta followed by attributes containing charset. This is why the charset declaration should appear as early in the <head> as possible, before any non-ASCII characters.

<!DOCTYPE html>
<html>
<head>
  <meta charset="UTF-8">  <!-- This must come first -->
  <title>Page with accented chars: café</title>

If café appeared before the charset declaration, the browser would have already started parsing with its guessed encoding.

Step 4: Statistical Analysis (Encoding Prescan)

If no charset has been determined, browsers perform a statistical scan of the document content. The Chromium/Blink implementation uses the ICU library's ucsdet (Universal Character Set Detector), while Firefox uses nsUniversalDetector — both based on the same statistical approach developed by Mozilla.

The algorithm analyzes:

Byte frequency distributions: Japanese text in UTF-8 has characteristic 3-byte sequences in the E3–E9 range. Chinese in GB2312 has characteristic high-byte pairs. Russian in KOI8-R has a different high-byte distribution than Windows-1251.
N-gram models: Common 2-byte and 3-byte sequences for each language/encoding are pre-computed. The input is scored against each model.
Structural patterns: UTF-8 requires alternating lead and continuation bytes. Shift-JIS has characteristic 2-byte sequences that don't overlap with valid UTF-8.

Step 5: Default Encoding

If statistical analysis fails or produces no clear winner, the browser falls back to a locale-based default — typically Windows-1252 for Western locales, Shift-JIS for Japanese locales, GB2312 for Chinese. This is the "last resort" that explains why the same file can display differently in browsers with different locale settings.

The chardet Library

chardet is a Python port of Mozilla's encoding detection algorithm. It's the most widely used encoding detector in Python and produces a confidence score along with the encoding name:

import chardet

with open('mystery.txt', 'rb') as f:
    raw_bytes = f.read()

result = chardet.detect(raw_bytes)
print(result)
# {'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}
# {'encoding': 'UTF-8', 'confidence': 0.99, 'language': ''}
# {'encoding': 'SHIFT_JIS', 'confidence': 0.87, 'language': 'Japanese'}

The confidence score is not a probability — it's a normalized score from the statistical model. A score of 0.99 means the byte patterns are overwhelmingly consistent with UTF-8 (which is usually definitive — if a sequence is valid UTF-8 and uses multi-byte characters, it's almost certainly UTF-8). A score of 0.50 means two encodings are equally plausible.

def safe_decode(raw_bytes: bytes, fallback: str = 'utf-8') -> tuple[str, str]:
    """Decode bytes with chardet, returning (text, encoding_used)."""
    # Try UTF-8 first — it's unambiguous when valid
    try:
        return raw_bytes.decode('utf-8'), 'utf-8'
    except UnicodeDecodeError:
        pass

    # Fall back to chardet
    detected = chardet.detect(raw_bytes)
    encoding = detected.get('encoding') or fallback
    confidence = detected.get('confidence', 0)

    if confidence < 0.5:
        # Low confidence — use fallback
        encoding = fallback

    return raw_bytes.decode(encoding, errors='replace'), encoding

charset-normalizer: The Modern Alternative

charset-normalizer is a newer library that claims better accuracy than chardet for edge cases, particularly for languages with limited statistical data. It uses a different approach based on Unicode properties rather than byte-level n-gram models:

from charset_normalizer import from_bytes, from_path

# Detect from bytes
result = from_bytes(raw_bytes).best()
print(result.encoding)      # 'utf-8'
print(result.encoding_aliases)  # ['utf8', 'u8', ...]

# Detect from file
result = from_path('mystery.txt').best()
print(result)

charset-normalizer is now the default detector in the requests library, replacing chardet. For new projects, prefer charset-normalizer.

Confidence Scores in Practice

How much should you trust a 73% confidence score? The answer depends on context:

Confidence	Interpretation	Action
> 0.99	Near-certain	Decode directly
0.90–0.99	High confidence	Decode, log for review
0.70–0.90	Moderate	Decode but verify critical data
0.50–0.70	Low	Consider multiple candidates, manual review
< 0.50	Very low	Probably ASCII-only or insufficient data

The 0.99 case for UTF-8 is special. If a byte sequence is valid UTF-8 and contains multi-byte sequences, it's almost certainly UTF-8 — no other common encoding produces the same byte patterns in the 0x80–0xBF range.

def is_likely_utf8(data: bytes) -> bool:
    """Fast UTF-8 check — more reliable than statistical detection."""
    try:
        data.decode('utf-8')
        # Valid UTF-8, but might be ASCII-only (need to check for multi-byte)
        return any(b > 0x7F for b in data)
    except UnicodeDecodeError:
        return False

Language-Specific Detection Challenges

Japanese is the hardest case. Modern Japanese text may be in UTF-8, Shift-JIS, EUC-JP, or ISO-2022-JP. A short text in Hiragana/Katakana is ambiguous between UTF-8 and Shift-JIS. Longer texts with Kanji become more distinctive. Japanese detection has around 85–90% accuracy in practice for texts over 1KB.

Chinese has similar challenges between UTF-8, GB2312/GBK (Simplified), and Big5 (Traditional). The script (Simplified vs Traditional characters) helps distinguish GB2312 from Big5, but UTF-8 uses distinct byte patterns that make it unambiguous if multi-byte sequences are present.

Arabic/Hebrew RTL scripts in older content may be Windows-1256 or Windows-1255 respectively. Detection accuracy is moderate — these encodings cover a much smaller byte range than CJK encodings, providing less statistical signal.

Central/Eastern European languages (Polish, Czech, Hungarian) are frequently in Windows-1250, ISO-8859-2, or UTF-8. The distributions are similar enough that confidence scores rarely exceed 0.80.

The X-Content-Type-Options Problem

Browsers sniff encoding but also sniff MIME types, and the two interact. The HTTP header X-Content-Type-Options: nosniff prevents MIME-type sniffing but does not prevent encoding detection — those are separate mechanisms. There's no equivalent header to force encoding detection off; the only reliable approach is to declare encoding in the Content-Type header.

# Django: Always set charset in responses
from django.http import HttpResponse

def my_view(request):
    return HttpResponse(
        content,
        content_type='text/html; charset=utf-8'
    )

Detecting Encoding in Your Workflow

When you receive files from external sources without declared encodings, build detection into your ingestion pipeline:

import chardet
from pathlib import Path

def ingest_text_file(path: Path) -> str:
    raw = path.read_bytes()

    # Try UTF-8 first (fast, unambiguous)
    try:
        return raw.decode('utf-8')
    except UnicodeDecodeError:
        pass

    # Use chardet for other encodings
    detected = chardet.detect(raw)
    encoding = detected['encoding']
    confidence = detected['confidence']

    if not encoding:
        raise ValueError(f"Could not detect encoding for {path}")

    if confidence < 0.7:
        import warnings
        warnings.warn(
            f"Low confidence ({confidence:.0%}) encoding detection: "
            f"{encoding} for {path}"
        )

    return raw.decode(encoding, errors='replace')

For batch processing, use charset-normalizer's CLI tool:

# Install
pip install charset-normalizer

# Detect encoding of a file
normalizer mystery.txt

# Normalize to UTF-8 in-place
normalizer --normalize mystery.txt

Paste any text with unknown encoding into our Encoding Converter to inspect the raw bytes and see how the text decodes under different encodings side by side.

Best Practices

The fundamental lesson from encoding detection is that you should never need it in systems you control. Detection is for legacy data ingestion and external files — not for your own application stack.

Always declare encoding. Every file you write, every HTTP response you send, every database connection you open should have an explicit encoding declaration. UTF-8 everywhere eliminates the need for detection.

Validate at input boundaries. When accepting file uploads or external data, detect or validate encoding at ingestion, convert to UTF-8, and store UTF-8 throughout. Never propagate unknown encodings into your system.

Don't trust detection for short texts. Statistical detectors need at least 200–500 bytes of non-ASCII content to produce reliable results. For short inputs, use explicit declaration or ask the sender.

Monitor low-confidence detections. Log any file where detection confidence is below 0.8. These are your edge cases, and accumulating them will help you identify patterns in upstream systems that are sending incorrect encodings.

Next in Series: UTF-16 and Surrogate Pairs: Why JavaScript Strings Are Complicated — how JavaScript's internal UTF-16 encoding affects string length, emoji handling, and regex matching.