UTF-16 is a variable-width Unicode encoding that uses 2 or 4 bytes to represent every Unicode character. It is the native string encoding in JavaScript, Java, C#, and the Windows API (Win32/WinRT). Understanding UTF-16 is essential for working correctly with emoji, historical scripts, and any characters outside the Basic Multilingual Plane.
Basic Structure
UTF-16 encodes characters from the Basic Multilingual Plane (U+0000-U+FFFF) as a single 16-bit code unit (2 bytes). Characters above U+FFFF -- including most emoji and many rare scripts -- require two 16-bit code units called a surrogate pair.
| Code Point Range | Encoding | Bytes |
|---|---|---|
| U+0000 - U+D7FF | Single code unit | 2 |
| U+E000 - U+FFFF | Single code unit | 2 |
| U+10000 - U+10FFFF | Surrogate pair | 4 |
Byte Order
Because UTF-16 uses 2-byte units, the byte order of the system matters. UTF-16 comes in two variants:
- UTF-16 BE (Big-Endian): most significant byte first
- UTF-16 LE (Little-Endian): least significant byte first
A Byte Order Mark (U+FEFF) at the start of a stream indicates which variant is in use. Without a BOM, UTF-16 BE is the assumed default per the Unicode standard.
JavaScript Strings Are UTF-16
All JavaScript strings are sequences of UTF-16 code units. This means .length counts code units, not characters:
const emoji = '\uD83D\uDE00'; // U+1F600, encoded as surrogate pair
console.log(emoji.length); // 2, not 1
console.log(emoji.codePointAt(0)); // 128512 (correct code point)
console.log(emoji.charCodeAt(0)); // 55357 (high surrogate)
console.log(emoji.charCodeAt(1)); // 56832 (low surrogate)
// Iterating correctly over code points
for (const char of emoji) {
console.log(char); // logs the emoji once
}
// Spread also handles surrogates correctly
console.log([...emoji].length); // 1
Python and UTF-16
text = '\U0001F600' # emoji
bytes_utf16 = text.encode('utf-16')
print(bytes_utf16) # includes BOM
bytes_utf16le = text.encode('utf-16-le') # no BOM
print(len(bytes_utf16le)) # 4 bytes for one emoji
When to Use UTF-16
UTF-16 is rarely chosen for new file formats or protocols today -- UTF-8 is preferred for web and storage. However, you encounter UTF-16 when working with:
- Windows file paths and APIs: the Win32
Wfunctions (CreateFileW, etc.) use UTF-16 LE - Java
Stringandchar: internally UTF-16;charholds one code unit - JavaScript engine internals: V8 and SpiderMonkey store strings as UTF-16
- Microsoft Office formats:
.docxXML content is UTF-16 in some streams
Common Pitfall: Length Counting
The most frequent UTF-16 bug is treating .length as a character count. For any application that handles emoji or non-BMP text, use [...str].length or Array.from(str).length in JavaScript, or iterate with for...of, to correctly count Unicode code points rather than code units.