Character Encoding Detection: How to Identify Unknown Text Encoding

Web Development Symbols for Developers May 28, 2024

○ 1. HTML Entities: The Complete Guide to Character References
○ 2. CSS Content Property: Using Unicode Symbols in Stylesheets
○ 3. Unicode-Aware Regex: Property Escapes and Multilingual Patterns
○ 4. JavaScript and Unicode: Strings, Code Points, and Grapheme Clusters
○ 5. Python and Unicode: The Complete Developer's Guide
○ 6. Unicode in URLs: Percent-Encoding, Punycode, and IRIs
○ 7. IDN Homograph Attacks: When Unicode Becomes a Security Threat
○ 8. Web Fonts and Unicode Subsetting: Loading Only What You Need
● 9. Character Encoding Detection: How to Identify Unknown Text Encoding
○ 10. Unicode Collation: How to Sort Text Correctly Across Languages

Table of Contents

Encoding detection is the art of determining how bytes map to characters when no one told you. It arises in every project that processes files or data from the real world: CSV uploads from Excel, legacy database dumps, scraped web content, email attachments, EDI feeds. The bad news: it is fundamentally an unsolvable problem in general. The good news: a hierarchy of reliable signals handles most real-world cases correctly.

Why Encoding Detection Is Hard

Consider the byte sequence 0xC0 0xA9. In ISO-8859-1, these are two separate characters: À and ©. In UTF-8, 0xC0 is an invalid start byte. In Windows-1252, they are À and ©. In MacRoman, © and …. Any heuristic that picks one is making a probabilistic guess.

The problem is that many encodings share overlapping byte values, and short samples do not contain enough statistical signal to distinguish between them reliably. A 10-character English filename in ISO-8859-1 and Windows-1252 is byte-for-byte identical. A Japanese document in Shift-JIS and EUC-JP have overlapping byte patterns. True encoding detection is approximate, not exact.

The Detection Hierarchy

Apply signals in this order, stopping as soon as you get a definitive answer:

Explicit declaration from a trusted source
Byte Order Mark (BOM)
Protocol/metadata encoding declaration
Statistical heuristics (chardet, ICU, uchardet)
Contextual hints (file extension, source locale, user input)
Fallback default (UTF-8 for new systems, or ask the user)

1. Explicit declaration

If the source of the data explicitly declares an encoding through a reliable channel, trust it:

# Database column with declared encoding
cursor.execute("SHOW CREATE TABLE users")  # MySQL: charset in schema

# Python source file: PEP 263 encoding declaration
# -*- coding: utf-8 -*-

# HTTP response: Content-Type header
response.headers.get('Content-Type')
# 'text/html; charset=windows-1252'

The operative word is "trusted." A file that declares itself UTF-8 in its HTML meta tag but was actually saved as ISO-8859-1 is lying. Declarations from sources you control are reliable; declarations embedded in user-uploaded content are a hint, not a guarantee.

2. BOM sniffing

A BOM (Byte Order Mark) is a specific byte sequence at the start of a file that unambiguously identifies both the encoding and, for UTF-16/UTF-32, the byte order:

BOM bytes	Encoding
`EF BB BF`	UTF-8
`FF FE`	UTF-16 LE
`FE FF`	UTF-16 BE
`FF FE 00 00`	UTF-32 LE
`00 00 FE FF`	UTF-32 BE

def detect_bom(data: bytes) -> tuple[str | None, int]:
    """
    Returns (encoding, bom_length) or (None, 0) if no BOM found.
    """
    if data.startswith(b'\xef\xbb\xbf'):
        return 'utf-8-sig', 3
    if data.startswith(b'\xff\xfe\x00\x00'):
        return 'utf-32-le', 4
    if data.startswith(b'\x00\x00\xfe\xff'):
        return 'utf-32-be', 4
    if data.startswith(b'\xff\xfe'):
        return 'utf-16-le', 2
    if data.startswith(b'\xfe\xff'):
        return 'utf-16-be', 2
    return None, 0

with open('file.txt', 'rb') as f:
    data = f.read()

encoding, bom_length = detect_bom(data)
if encoding:
    text = data[bom_length:].decode(encoding.replace('-sig', ''))

In Python, the utf-8-sig codec automatically handles the UTF-8 BOM:

# utf-8-sig strips the BOM on read, adds it on write:
with open('file.txt', 'r', encoding='utf-8-sig') as f:
    content = f.read()  # BOM automatically stripped

UTF-8 BOMs are optional and controversial. Linux and web standards discourage them; Windows tools (Excel, Notepad) add them by default. Expect them in any file produced by Windows software.

3. Protocol declarations

HTTP uses Content-Type headers; HTML has <meta charset> and <meta http-equiv="Content-Type">; XML has the <?xml?> declaration:

import re
from urllib.request import urlopen

def get_encoding_from_http(url: str) -> str | None:
    response = urlopen(url)
    content_type = response.headers.get('Content-Type', '')
    # 'text/html; charset=utf-8'
    match = re.search(r'charset=([^\s;]+)', content_type, re.IGNORECASE)
    return match.group(1) if match else None

def get_encoding_from_html_meta(html_bytes: bytes) -> str | None:
    """
    Scan the first 1024 bytes for an HTML encoding declaration.
    Must be done before full decoding (chicken-and-egg problem).
    """
    head = html_bytes[:1024]
    # <meta charset="utf-8">
    match = re.search(rb'<meta[^>]+charset=["\']?([^"\'\s;>]+)', head, re.IGNORECASE)
    if match:
        return match.group(1).decode('ascii', errors='ignore')
    return None

The HTML meta charset declaration is a bootstrap mechanism: the parser reads ASCII-compatible bytes until it finds the charset, then restarts with the declared encoding. This is why the <meta charset> must appear in the first 1024 bytes of the document.

4. Statistical heuristics

When no explicit declaration is available, use statistical analysis:

pip install chardet charset-normalizer

import chardet

with open('unknown.txt', 'rb') as f:
    data = f.read()

result = chardet.detect(data)
print(result)
# {'encoding': 'EUC-JP', 'confidence': 0.99, 'language': 'Japanese'}
# {'encoding': 'windows-1252', 'confidence': 0.73, 'language': ''}

charset-normalizer is a more modern alternative with better accuracy:

from charset_normalizer import from_bytes, from_path

# From bytes
result = from_bytes(data).best()
print(result.encoding)    # 'utf-8'
print(result.chaos)       # 0.0 — lower is better
print(result.coherence)   # 0.99 — higher is better

# From file
result = from_path('unknown.txt').best()
if result:
    text = str(result)

Interpreting confidence scores

def decode_unknown(data: bytes, fallback: str = 'utf-8') -> tuple[str, str]:
    """
    Returns (decoded_text, encoding_used).
    Raises ValueError if detection confidence is too low.
    """
    # Always try UTF-8 first — it's self-validating
    try:
        return data.decode('utf-8'), 'utf-8'
    except UnicodeDecodeError:
        pass

    result = chardet.detect(data)
    encoding = result.get('encoding')
    confidence = result.get('confidence', 0)

    if not encoding:
        return data.decode(fallback, errors='replace'), fallback

    if confidence < 0.7:
        # Low confidence: warn and use replacement characters
        import warnings
        warnings.warn(
            f"Low encoding confidence ({confidence:.0%}) for '{encoding}'. "
            f"Using replacement characters for undecodable bytes.",
            UnicodeWarning,
            stacklevel=2,
        )
        return data.decode(encoding, errors='replace'), encoding

    return data.decode(encoding), encoding

The chardet limitations

chardet originated from Mozilla's Universal Charset Detector. Its limitations:

Short samples: needs at least a few hundred bytes for reliable detection; fewer than 100 bytes is unreliable for most encodings
ASCII-compatible encodings: cannot distinguish ISO-8859-1 from Windows-1252 from ISO-8859-15 on purely ASCII content
East Asian ambiguity: Shift-JIS vs. EUC-JP vs. GB2312 detection requires sufficient non-ASCII content
Single-byte encodings: dozens of ISO-8859- and Windows-125 variants look similar
UTF-16 without BOM: always add a BOM when writing UTF-16 to avoid this

5. Contextual hints

When statistical detection gives low confidence, use context:

import locale

LOCALE_ENCODING_HINTS = {
    'ja': ['shift_jis', 'euc-jp', 'iso-2022-jp'],
    'zh': ['gb2312', 'gb18030', 'big5'],
    'ko': ['euc-kr', 'iso-2022-kr'],
    'ru': ['windows-1251', 'koi8-r', 'iso-8859-5'],
    'ar': ['windows-1256', 'iso-8859-6'],
    'th': ['tis-620', 'windows-874'],
    'tr': ['windows-1254', 'iso-8859-9'],
    'el': ['windows-1253', 'iso-8859-7'],
}

def detect_with_locale_hint(data: bytes, locale_hint: str | None = None) -> str:
    """
    Try locale-specific encodings before falling back to chardet.
    """
    # 1. Try UTF-8 (always)
    try:
        data.decode('utf-8')
        return 'utf-8'
    except UnicodeDecodeError:
        pass

    # 2. Try locale-hinted encodings
    if locale_hint:
        lang = locale_hint.split('-')[0].lower()
        for enc in LOCALE_ENCODING_HINTS.get(lang, []):
            try:
                data.decode(enc)
                return enc
            except UnicodeDecodeError:
                continue

    # 3. Fall back to chardet
    result = chardet.detect(data)
    return result.get('encoding', 'utf-8')

mojibake">Detecting Mojibake

Mojibake (文字化け — "garbled text") is the result of decoding text with the wrong encoding. It looks like garbage characters:

Correct:  "Héllo"
Mojibake: "HÃ©llo"  (UTF-8 bytes decoded as ISO-8859-1)
Mojibake: "H鑒llo"   (completely wrong encoding)

Detecting mojibake in text you have already decoded:

import unicodedata

def has_likely_mojibake(text: str) -> bool:
    """
    Heuristic: high proportion of replacement characters or
    unexpected combining sequences suggests mojibake.
    """
    if not text:
        return False

    replacement_count = text.count('\ufffd')
    if replacement_count / len(text) > 0.05:
        return True  # > 5% replacement characters

    # Check for common UTF-8-as-Latin1 mojibake patterns
    mojibake_patterns = ['Ã©', 'Ã¨', 'Ã ', 'Ã¢', 'Ã®', 'Ã´', 'Ã»', 'Ã¼']
    text_lower = text.lower()
    if any(pattern in text_lower for pattern in mojibake_patterns):
        return True

    return False

def fix_utf8_as_latin1(text: str) -> str:
    """
    Fix text that was UTF-8 decoded as Latin-1 (common mojibake).
    'Héllo' → 'Héllo'
    """
    try:
        return text.encode('latin-1').decode('utf-8')
    except (UnicodeDecodeError, UnicodeEncodeError):
        return text  # not this type of mojibake

CSV and Spreadsheet Files

Excel is the most common source of encoding surprises. Its CSV export behavior:

Windows Excel: UTF-8 with BOM, or regional encoding (e.g., Windows-1252 for Western European)
Mac Excel: UTF-8 (newer versions), MacRoman (older versions)
LibreOffice: usually UTF-8 with BOM

import csv
from charset_normalizer import from_path

def read_csv_unknown_encoding(filepath: str) -> list[dict]:
    """
    Read a CSV file with automatic encoding detection.
    """
    # Detect encoding first
    result = from_path(filepath).best()
    detected_encoding = str(result.encoding) if result else 'utf-8-sig'

    with open(filepath, 'r', encoding=detected_encoding, errors='replace') as f:
        reader = csv.DictReader(f)
        return list(reader)

HTTP and HTML Encoding Detection

The correct algorithm for HTML (as specified in the HTML standard):

def determine_html_encoding(
    content_type_header: str | None,
    html_bytes: bytes,
) -> str:
    # 1. BOM
    if html_bytes.startswith(b'\xef\xbb\xbf'):
        return 'utf-8'
    if html_bytes.startswith(b'\xfe\xff'):
        return 'utf-16-be'
    if html_bytes.startswith(b'\xff\xfe'):
        return 'utf-16-le'

    # 2. HTTP Content-Type header
    if content_type_header:
        match = re.search(r'charset=([^\s;]+)', content_type_header, re.IGNORECASE)
        if match:
            return match.group(1)

    # 3. HTML meta charset (prescan first 1024 bytes)
    head = html_bytes[:1024]
    match = re.search(rb'charset=["\']?([^"\'\s;>]+)', head, re.IGNORECASE)
    if match:
        declared = match.group(1).decode('ascii', errors='ignore')
        # Windows-1252 is the correct interpretation when latin-1 is declared
        if declared.lower() in ('iso-8859-1', 'latin-1', 'iso8859-1'):
            return 'windows-1252'
        return declared

    # 4. Statistical detection
    result = chardet.detect(html_bytes)
    if result['confidence'] > 0.8:
        return result['encoding']

    # 5. Default: windows-1252 for western web content (per HTML spec)
    return 'windows-1252'

Use the SymbolFYI Encoding Converter to convert between encodings and visualize byte sequences, and the Character Counter to identify replacement characters and unexpected code points in text that may have been decoded with the wrong encoding.

Next in Series: Unicode Collation: How to Sort Text Correctly Across Languages — why lexicographic byte order produces wrong sort results for most languages, and how to use the Unicode Collation Algorithm in JavaScript, Python, and PostgreSQL.