SymbolFYI

UTF-16

Encoding
Definition

A character encoding that uses 2 or 4 bytes per character. Used internally by JavaScript and Java.

UTF-16 is a variable-width Unicode encoding that uses 2 or 4 bytes to represent every Unicode character. It is the native string encoding in JavaScript, Java, C#, and the Windows API (Win32/WinRT). Understanding UTF-16 is essential for working correctly with emoji, historical scripts, and any characters outside the Basic Multilingual Plane.

Basic Structure

UTF-16 encodes characters from the Basic Multilingual Plane (U+0000-U+FFFF) as a single 16-bit code unit (2 bytes). Characters above U+FFFF -- including most emoji and many rare scripts -- require two 16-bit code units called a surrogate pair.

Code Point Range Encoding Bytes
U+0000 - U+D7FF Single code unit 2
U+E000 - U+FFFF Single code unit 2
U+10000 - U+10FFFF Surrogate pair 4

Byte Order

Because UTF-16 uses 2-byte units, the byte order of the system matters. UTF-16 comes in two variants:

  • UTF-16 BE (Big-Endian): most significant byte first
  • UTF-16 LE (Little-Endian): least significant byte first

A Byte Order Mark (U+FEFF) at the start of a stream indicates which variant is in use. Without a BOM, UTF-16 BE is the assumed default per the Unicode standard.

JavaScript Strings Are UTF-16

All JavaScript strings are sequences of UTF-16 code units. This means .length counts code units, not characters:

const emoji = '\uD83D\uDE00'; // U+1F600, encoded as surrogate pair
console.log(emoji.length);         // 2, not 1
console.log(emoji.codePointAt(0)); // 128512 (correct code point)
console.log(emoji.charCodeAt(0));  // 55357 (high surrogate)
console.log(emoji.charCodeAt(1));  // 56832 (low surrogate)

// Iterating correctly over code points
for (const char of emoji) {
  console.log(char); // logs the emoji once
}

// Spread also handles surrogates correctly
console.log([...emoji].length); // 1

Python and UTF-16

text = '\U0001F600'  # emoji
bytes_utf16 = text.encode('utf-16')
print(bytes_utf16)  # includes BOM

bytes_utf16le = text.encode('utf-16-le')  # no BOM
print(len(bytes_utf16le))  # 4 bytes for one emoji

When to Use UTF-16

UTF-16 is rarely chosen for new file formats or protocols today -- UTF-8 is preferred for web and storage. However, you encounter UTF-16 when working with:

  • Windows file paths and APIs: the Win32 W functions (CreateFileW, etc.) use UTF-16 LE
  • Java String and char: internally UTF-16; char holds one code unit
  • JavaScript engine internals: V8 and SpiderMonkey store strings as UTF-16
  • Microsoft Office formats: .docx XML content is UTF-16 in some streams

Common Pitfall: Length Counting

The most frequent UTF-16 bug is treating .length as a character count. For any application that handles emoji or non-BMP text, use [...str].length or Array.from(str).length in JavaScript, or iterate with for...of, to correctly count Unicode code points rather than code units.

Verwandte Symbole

Verwandte Begriffe

Verwandte Werkzeuge

Verwandte Anleitungen