SymbolFYI

UTF-16

Encoding
التعريف

A character encoding that uses 2 or 4 bytes per character. Used internally by JavaScript and Java.

UTF-16 is a variable-width Unicode encoding that uses 2 or 4 bytes to represent every Unicode character. It is the native string encoding in JavaScript, Java, C#, and the Windows API (Win32/WinRT). Understanding UTF-16 is essential for working correctly with emoji, historical scripts, and any characters outside the Basic Multilingual Plane.

Basic Structure

UTF-16 encodes characters from the Basic Multilingual Plane (U+0000-U+FFFF) as a single 16-bit code unit (2 bytes). Characters above U+FFFF -- including most emoji and many rare scripts -- require two 16-bit code units called a surrogate pair.

Code Point Range Encoding Bytes
U+0000 - U+D7FF Single code unit 2
U+E000 - U+FFFF Single code unit 2
U+10000 - U+10FFFF Surrogate pair 4

Byte Order

Because UTF-16 uses 2-byte units, the byte order of the system matters. UTF-16 comes in two variants:

  • UTF-16 BE (Big-Endian): most significant byte first
  • UTF-16 LE (Little-Endian): least significant byte first

A Byte Order Mark (U+FEFF) at the start of a stream indicates which variant is in use. Without a BOM, UTF-16 BE is the assumed default per the Unicode standard.

JavaScript Strings Are UTF-16

All JavaScript strings are sequences of UTF-16 code units. This means .length counts code units, not characters:

const emoji = '\uD83D\uDE00'; // U+1F600, encoded as surrogate pair
console.log(emoji.length);         // 2, not 1
console.log(emoji.codePointAt(0)); // 128512 (correct code point)
console.log(emoji.charCodeAt(0));  // 55357 (high surrogate)
console.log(emoji.charCodeAt(1));  // 56832 (low surrogate)

// Iterating correctly over code points
for (const char of emoji) {
  console.log(char); // logs the emoji once
}

// Spread also handles surrogates correctly
console.log([...emoji].length); // 1

Python and UTF-16

text = '\U0001F600'  # emoji
bytes_utf16 = text.encode('utf-16')
print(bytes_utf16)  # includes BOM

bytes_utf16le = text.encode('utf-16-le')  # no BOM
print(len(bytes_utf16le))  # 4 bytes for one emoji

When to Use UTF-16

UTF-16 is rarely chosen for new file formats or protocols today -- UTF-8 is preferred for web and storage. However, you encounter UTF-16 when working with:

  • Windows file paths and APIs: the Win32 W functions (CreateFileW, etc.) use UTF-16 LE
  • Java String and char: internally UTF-16; char holds one code unit
  • JavaScript engine internals: V8 and SpiderMonkey store strings as UTF-16
  • Microsoft Office formats: .docx XML content is UTF-16 in some streams

Common Pitfall: Length Counting

The most frequent UTF-16 bug is treating .length as a character count. For any application that handles emoji or non-BMP text, use [...str].length or Array.from(str).length in JavaScript, or iterate with for...of, to correctly count Unicode code points rather than code units.

الرموز ذات الصلة

المصطلحات ذات الصلة

الأدوات ذات الصلة

الأدلة ذات الصلة