SymbolFYI

Encoding Detection

Programming & Dev
定義

Techniques for detecting the character encoding of text files, including BOM sniffing, heuristics, and chardet libraries.

Encoding Detection

Encoding detection (also called charset detection) is the process of inferring the character encoding of a byte sequence when that information is not explicitly provided. It is needed when processing legacy files, scraped web content, email attachments, or any data source that lacks a reliable encoding declaration.

Why Encoding Detection Is Needed

In an ideal world, every text file declares its encoding. In practice:

  • Legacy files were created before UTF-8 became standard
  • HTTP responses sometimes have incorrect or missing Content-Type: charset headers
  • User-uploaded files carry no metadata
  • Database exports from old systems may lack encoding information

Without detection, attempting to read a Windows-1252 file as UTF-8 produces garbled output (mojibake).

Detection Approaches

BOM Sniffing

The Byte Order Mark (BOM) is a specific byte sequence at the start of a file that identifies both the encoding and byte order:

BOM Bytes Encoding
EF BB BF UTF-8
FF FE UTF-16 LE
FE FF UTF-16 BE
FF FE 00 00 UTF-32 LE
00 00 FE FF UTF-32 BE

BOM sniffing is reliable when a BOM is present, but many files (especially UTF-8) omit it.

Statistical Analysis

Libraries like chardet and charset-normalizer analyze byte frequency patterns and multi-byte sequences to probabilistically identify the encoding:

import chardet

with open('mystery_file.txt', 'rb') as f:
    raw = f.read()

result = chardet.detect(raw)
print(result)
# {'encoding': 'EUC-JP', 'confidence': 0.99, 'language': 'Japanese'}

# Decode using the detected encoding
text = raw.decode(result['encoding'])

charset-normalizer (Modern Alternative)

charset-normalizer is a more accurate, pure-Python replacement for chardet, used by the requests library:

from charset_normalizer import from_bytes

result = from_bytes(raw_bytes).best()
if result:
    print(result.encoding)     # 'utf-8'
    print(str(result))         # decoded text

ICU (International Components for Unicode)

ICU provides CharsetDetector, a high-quality detection engine used in browsers, mobile operating systems, and Java applications. It is accessible from Python via PyICU.

HTML and XML Meta Declarations

For HTML and XML, encoding information may appear in the content itself:

<!-- HTML5 -->
<meta charset="UTF-8">

<!-- HTML4 -->
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<?xml version="1.0" encoding="windows-1252"?>

Parsers like BeautifulSoup use a tiered detection strategy: HTTP header → BOM → meta tag → statistical analysis.

Limitations

Encoding detection is fundamentally heuristic — it can be wrong, especially for:

  • Short texts (insufficient statistical data)
  • ASCII-only content (valid in any ASCII-compatible encoding)
  • Encoding-ambiguous byte sequences (e.g., valid UTF-8 that is also valid Latin-1)

Best practice: Always declare encoding explicitly. Use UTF-8 for all new files and systems. Reserve detection for legacy data ingestion pipelines.

関連記号

関連用語

関連ツール

関連ガイド