SymbolFYI

Byte Order Mark (BOM)

Encoding
Định nghĩa

A special Unicode character (U+FEFF) at the start of a file indicating its byte order and encoding format.

The Byte Order Mark (BOM) is the Unicode character U+FEFF placed at the very beginning of a text stream or file to signal the encoding and byte order to readers. It is most commonly associated with UTF-16 and UTF-32, where byte order is ambiguous, but also appears in UTF-8 files -- particularly those generated by Windows tools.

Purpose and Function

The BOM serves two roles:

  1. Byte order detection: In UTF-16 and UTF-32, a 16-bit or 32-bit integer can be stored big-endian or little-endian. By reading the first bytes and comparing them to the known BOM patterns, a decoder can determine which byte order is in use.
  2. Encoding signature: A decoder that encounters certain BOM byte patterns can identify the encoding unambiguously, even without external metadata.

BOM Byte Sequences

Encoding BOM Bytes (hex)
UTF-8 EF BB BF
UTF-16 BE FE FF
UTF-16 LE FF FE
UTF-32 BE 00 00 FE FF
UTF-32 LE FF FE 00 00

A reader that sees FF FE knows the stream is UTF-16 little-endian. A reader that sees FE FF knows it is UTF-16 big-endian.

The UTF-8 BOM Controversy

In UTF-8, byte order is irrelevant -- the encoding processes bytes sequentially and the order is always defined. The UTF-8 BOM (EF BB BF) serves only as an encoding signature, not a byte order indicator. Its use in UTF-8 files is optional and contentious:

  • Windows tools (Notepad, many Microsoft editors) historically added and expected the UTF-8 BOM
  • Unix/Linux tools treat the BOM as three extra bytes at the start of the file, which can break scripts, CSV parsers, HTTP responses, and other formats sensitive to leading content
  • HTML and web standards discourage the UTF-8 BOM; the charset declaration in HTTP headers or <meta> tags is the correct mechanism

Detecting and Removing the BOM

import codecs

# Writing with BOM
with open('/tmp/bom_test.txt', 'w', encoding='utf-8-sig') as f:
    f.write('Hello')

# Reading: utf-8-sig strips the BOM automatically
with open('/tmp/bom_test.txt', 'r', encoding='utf-8-sig') as f:
    print(f.read())  # 'Hello' (no BOM visible)

# Reading raw: utf-8 preserves BOM as U+FEFF
with open('/tmp/bom_test.txt', 'r', encoding='utf-8') as f:
    text = f.read()
    print(repr(text[:1]))  # '\ufeff'

# Stripping BOM manually
if text.startswith('\ufeff'):
    text = text[1:]
// Node.js: check for UTF-8 BOM in a Buffer
const fs = require('fs');
const buf = fs.readFileSync('/tmp/bom_test.txt');
const hasBom = buf[0] === 0xEF && buf[1] === 0xBB && buf[2] === 0xBF;
console.log(hasBom); // true

// Strip BOM from string
function stripBom(str) {
  return str.charCodeAt(0) === 0xFEFF ? str.slice(1) : str;
}

Best Practices

For UTF-8 files: avoid the BOM unless you must produce files for legacy Windows tools that require it. If you receive files that may or may not have a BOM, use encoding utf-8-sig in Python or strip U+FEFF defensively. For UTF-16 and UTF-32, always include a BOM, as it is the most reliable way for readers to determine byte order without relying on external metadata.

Ký hiệu liên quan

Thuật ngữ liên quan

Công cụ liên quan

Hướng dẫn liên quan