The Byte Order Mark (BOM) is the Unicode character U+FEFF placed at the very beginning of a text stream or file to signal the encoding and byte order to readers. It is most commonly associated with UTF-16 and UTF-32, where byte order is ambiguous, but also appears in UTF-8 files -- particularly those generated by Windows tools.
Purpose and Function
The BOM serves two roles:
- Byte order detection: In UTF-16 and UTF-32, a 16-bit or 32-bit integer can be stored big-endian or little-endian. By reading the first bytes and comparing them to the known BOM patterns, a decoder can determine which byte order is in use.
- Encoding signature: A decoder that encounters certain BOM byte patterns can identify the encoding unambiguously, even without external metadata.
BOM Byte Sequences
| Encoding | BOM Bytes (hex) |
|---|---|
| UTF-8 | EF BB BF |
| UTF-16 BE | FE FF |
| UTF-16 LE | FF FE |
| UTF-32 BE | 00 00 FE FF |
| UTF-32 LE | FF FE 00 00 |
A reader that sees FF FE knows the stream is UTF-16 little-endian. A reader that sees FE FF knows it is UTF-16 big-endian.
The UTF-8 BOM Controversy
In UTF-8, byte order is irrelevant -- the encoding processes bytes sequentially and the order is always defined. The UTF-8 BOM (EF BB BF) serves only as an encoding signature, not a byte order indicator. Its use in UTF-8 files is optional and contentious:
- Windows tools (Notepad, many Microsoft editors) historically added and expected the UTF-8 BOM
- Unix/Linux tools treat the BOM as three extra bytes at the start of the file, which can break scripts, CSV parsers, HTTP responses, and other formats sensitive to leading content
- HTML and web standards discourage the UTF-8 BOM; the
charsetdeclaration in HTTP headers or<meta>tags is the correct mechanism
Detecting and Removing the BOM
import codecs
# Writing with BOM
with open('/tmp/bom_test.txt', 'w', encoding='utf-8-sig') as f:
f.write('Hello')
# Reading: utf-8-sig strips the BOM automatically
with open('/tmp/bom_test.txt', 'r', encoding='utf-8-sig') as f:
print(f.read()) # 'Hello' (no BOM visible)
# Reading raw: utf-8 preserves BOM as U+FEFF
with open('/tmp/bom_test.txt', 'r', encoding='utf-8') as f:
text = f.read()
print(repr(text[:1])) # '\ufeff'
# Stripping BOM manually
if text.startswith('\ufeff'):
text = text[1:]
// Node.js: check for UTF-8 BOM in a Buffer
const fs = require('fs');
const buf = fs.readFileSync('/tmp/bom_test.txt');
const hasBom = buf[0] === 0xEF && buf[1] === 0xBB && buf[2] === 0xBF;
console.log(hasBom); // true
// Strip BOM from string
function stripBom(str) {
return str.charCodeAt(0) === 0xFEFF ? str.slice(1) : str;
}
Best Practices
For UTF-8 files: avoid the BOM unless you must produce files for legacy Windows tools that require it. If you receive files that may or may not have a BOM, use encoding utf-8-sig in Python or strip U+FEFF defensively. For UTF-16 and UTF-32, always include a BOM, as it is the most reliable way for readers to determine byte order without relying on external metadata.