SymbolFYI

Byte Order Mark (BOM)

Encoding
Definisi

A special Unicode character (U+FEFF) at the start of a file indicating its byte order and encoding format.

The Byte Order Mark (BOM) is the Unicode character U+FEFF placed at the very beginning of a text stream or file to signal the encoding and byte order to readers. It is most commonly associated with UTF-16 and UTF-32, where byte order is ambiguous, but also appears in UTF-8 files -- particularly those generated by Windows tools.

Purpose and Function

The BOM serves two roles:

  1. Byte order detection: In UTF-16 and UTF-32, a 16-bit or 32-bit integer can be stored big-endian or little-endian. By reading the first bytes and comparing them to the known BOM patterns, a decoder can determine which byte order is in use.
  2. Encoding signature: A decoder that encounters certain BOM byte patterns can identify the encoding unambiguously, even without external metadata.

BOM Byte Sequences

Encoding BOM Bytes (hex)
UTF-8 EF BB BF
UTF-16 BE FE FF
UTF-16 LE FF FE
UTF-32 BE 00 00 FE FF
UTF-32 LE FF FE 00 00

A reader that sees FF FE knows the stream is UTF-16 little-endian. A reader that sees FE FF knows it is UTF-16 big-endian.

The UTF-8 BOM Controversy

In UTF-8, byte order is irrelevant -- the encoding processes bytes sequentially and the order is always defined. The UTF-8 BOM (EF BB BF) serves only as an encoding signature, not a byte order indicator. Its use in UTF-8 files is optional and contentious:

  • Windows tools (Notepad, many Microsoft editors) historically added and expected the UTF-8 BOM
  • Unix/Linux tools treat the BOM as three extra bytes at the start of the file, which can break scripts, CSV parsers, HTTP responses, and other formats sensitive to leading content
  • HTML and web standards discourage the UTF-8 BOM; the charset declaration in HTTP headers or <meta> tags is the correct mechanism

Detecting and Removing the BOM

import codecs

# Writing with BOM
with open('/tmp/bom_test.txt', 'w', encoding='utf-8-sig') as f:
    f.write('Hello')

# Reading: utf-8-sig strips the BOM automatically
with open('/tmp/bom_test.txt', 'r', encoding='utf-8-sig') as f:
    print(f.read())  # 'Hello' (no BOM visible)

# Reading raw: utf-8 preserves BOM as U+FEFF
with open('/tmp/bom_test.txt', 'r', encoding='utf-8') as f:
    text = f.read()
    print(repr(text[:1]))  # '\ufeff'

# Stripping BOM manually
if text.startswith('\ufeff'):
    text = text[1:]
// Node.js: check for UTF-8 BOM in a Buffer
const fs = require('fs');
const buf = fs.readFileSync('/tmp/bom_test.txt');
const hasBom = buf[0] === 0xEF && buf[1] === 0xBB && buf[2] === 0xBF;
console.log(hasBom); // true

// Strip BOM from string
function stripBom(str) {
  return str.charCodeAt(0) === 0xFEFF ? str.slice(1) : str;
}

Best Practices

For UTF-8 files: avoid the BOM unless you must produce files for legacy Windows tools that require it. If you receive files that may or may not have a BOM, use encoding utf-8-sig in Python or strip U+FEFF defensively. For UTF-16 and UTF-32, always include a BOM, as it is the most reliable way for readers to determine byte order without relying on external metadata.

Simbol Terkait

Istilah Terkait

Alat Terkait

Panduan Terkait