UTF-16 and Surrogate Pairs: Why JavaScript Strings Are Complicated

Reference Encoding Survival Guide 7월 23, 2024

○ 1. UTF-8: The Complete Guide to the Web's Dominant Encoding
○ 2. Mojibake: Why Text Turns to Garbage and How to Fix It
○ 3. Character Encoding Detection: How Browsers and Tools Guess Your Encoding
● 4. UTF-16 and Surrogate Pairs: Why JavaScript Strings Are Complicated
○ 5. Legacy Encodings: Latin-1, Windows-1252, Shift-JIS, and When You Still Need Them
○ 6. Punycode and IDN: How Unicode Domain Names Work

Open a JavaScript console and type "👍".length. The result is 2, not 1. This surprises most developers. The thumbs-up emoji is one character, but JavaScript reports it as two. A string containing only "𝕳" has length 2. The regex /^.$/.test("👍") returns false. These behaviors stem from a single architectural decision made in the 1990s: JavaScript strings are sequences of UTF-16 code units, not Unicode characters.

Understanding why this is true — and how to work around it — requires understanding UTF-16, surrogate pairs, and the historical context that made this a reasonable choice at the time.

The Unicode Size Problem

In the early 1990s, Unicode was designed with the assumption that 65,536 code points would be enough to encode every character in every writing system. This is the Basic Multilingual Plane (BMP): U+0000 to U+FFFF. A 16-bit code unit can represent any BMP character directly, making UTF-16 appealing: fixed-width, simple indexing, efficient for East Asian scripts.

Java adopted 16-bit chars in 1995. JavaScript inherited the same design from its 10-day development sprint in 1995. Windows NT used 16-bit "wide characters" (wchar_t). The assumption was baked deeply into these platforms.

Then Unicode grew. Unicode 2.0 (1996) extended the range to U+10FFFF — 1,114,112 code points — adding 16 "supplementary planes" beyond the BMP. Chinese, Japanese, and Korean unification wasn't enough; historic scripts, musical notation, mathematical symbols, and eventually emoji filled supplementary planes.

UTF-16 needed a way to encode code points above U+FFFF without breaking all the existing code built on 16-bit strings. The solution was surrogate pairs.

How Surrogate Pairs Work

Unicode reserved a block of 2,048 code points in the BMP — U+D800 to U+DFFF — specifically for use as surrogate pairs. These code points are deliberately unassigned as regular characters; they only appear in UTF-16 as part of a pair.

High surrogates: U+D800–U+DBFF (1,024 values)
Low surrogates: U+DC00–U+DFFF (1,024 values)

To encode a supplementary character (code point U in the range U+10000 to U+10FFFF):

Subtract 0x10000 from U, giving a 20-bit value (range 0x00000 to 0xFFFFF)
The high 10 bits become the high surrogate: add 0xD800
The low 10 bits become the low surrogate: add 0xDC00

def encode_surrogate_pair(code_point: int) -> tuple[int, int]:
    """Encode a supplementary code point as a UTF-16 surrogate pair."""
    assert 0x10000 <= code_point <= 0x10FFFF
    u_prime = code_point - 0x10000
    high = 0xD800 + (u_prime >> 10)      # High 10 bits
    low  = 0xDC00 + (u_prime & 0x3FF)    # Low 10 bits
    return high, low

def decode_surrogate_pair(high: int, low: int) -> int:
    """Decode a UTF-16 surrogate pair back to a code point."""
    assert 0xD800 <= high <= 0xDBFF
    assert 0xDC00 <= low  <= 0xDFFF
    return 0x10000 + ((high - 0xD800) << 10) + (low - 0xDC00)

# Example: Thumbs up emoji 👍 = U+1F44D
high, low = encode_surrogate_pair(0x1F44D)
print(f"High: U+{high:04X}")  # U+D83D
print(f"Low:  U+{low:04X}")   # U+DC4D
print(decode_surrogate_pair(high, low) == 0x1F44D)  # True

The pair U+D83D U+DC4D represents 👍. In JavaScript, these are two separate UTF-16 code units stored in the string, which is why .length returns 2.

Why JavaScript .length Lies

JavaScript's String.prototype.length returns the number of UTF-16 code units, not the number of Unicode characters (code points). For BMP characters, these are the same. For supplementary characters, each character contributes 2 to the length.

// BMP characters — length matches
"Hello".length;        // 5
"Héllo".length;        // 5 (é is U+00E9, BMP)
"日本語".length;        // 3 (all BMP)

// Supplementary characters — length doesn't match
"👍".length;           // 2 (U+1F44D, surrogate pair)
"𝕳".length;           // 2 (U+1D573, Mathematical Fraktur)
"🏳️‍🌈".length;        // 14 (rainbow flag: multiple code points + ZWJ sequence!)

// The code units, exposed
"👍".charCodeAt(0).toString(16);  // "d83d" — high surrogate
"👍".charCodeAt(1).toString(16);  // "dc4d" — low surrogate

The .charCodeAt() method returns the UTF-16 code unit at a given position — not the code point. For supplementary characters, calling .charCodeAt(0) returns only the high surrogate, which is not a meaningful character.

codePointAt vs charCodeAt

ES2015 introduced String.prototype.codePointAt() which correctly handles surrogate pairs:

const thumbsUp = "👍";

// Old way — broken for supplementary characters
thumbsUp.charCodeAt(0);     // 55357 (0xD83D, high surrogate)
thumbsUp.charCodeAt(1);     // 56397 (0xDC4D, low surrogate)

// New way — returns the actual code point
thumbsUp.codePointAt(0);    // 128077 (0x1F44D, correct!)
thumbsUp.codePointAt(1);    // 56397 (still the low surrogate — quirk!)

// String.fromCodePoint vs String.fromCharCode
String.fromCodePoint(0x1F44D);   // "👍"
String.fromCharCode(0x1F44D);    // garbage (truncates to 0xF44D → wrong char)

The quirk in codePointAt(1) — returning the low surrogate when called at position 1 of a surrogate pair — is by specification. The index is still a UTF-16 code unit index.

Iterating Over Characters

The for...of loop and the spread operator in ES2015+ iterate over Unicode code points (not UTF-16 code units), correctly handling surrogate pairs:

// for...of correctly iterates code points
for (const char of "Hello 👍") {
    console.log(char);  // H, e, l, l, o, ' ', 👍  — 7 iterations
}

// Spread — same behavior
[..."Hello 👍"].length;   // 7

// Array.from — same
Array.from("Hello 👍").length;   // 7

// Old-style for loop — iterates code units
const s = "Hello 👍";
for (let i = 0; i < s.length; i++) {
    console.log(s[i]);  // H, e, l, l, o, ' ', '\uD83D', '\uDC4D' — 8 iterations!
}

For counting "characters" as a user perceives them, even code point counting isn't sufficient. Many characters are composed of multiple code points: é can be either U+00E9 (precomposed) or U+0065 U+0301 (e + combining acute accent). Flag emojis are sequences of regional indicator symbols. The rainbow flag 🏳️‍🌈 is 4 code points joined by a zero-width joiner.

For true user-visible character counts (grapheme clusters), use Intl.Segmenter:

const segmenter = new Intl.Segmenter();

function graphemeCount(str) {
    return [...segmenter.segment(str)].length;
}

graphemeCount("Hello");       // 5
graphemeCount("Hello 👍");    // 7
graphemeCount("🏳️‍🌈");       // 1 (the whole flag is one grapheme cluster)
graphemeCount("é");           // 1 (even if it's two code points: e + combining)

Intl.Segmenter is available in all modern browsers (Chrome 87+, Firefox 125+, Safari 14.1+) and Node.js 16+.

Regex and Supplementary Characters

Regular expressions in JavaScript operate on UTF-16 code units by default. The . metacharacter matches one code unit, not one character:

/^.$/.test("a");    // true
/^.$/.test("é");    // true (BMP)
/^.$/.test("👍");   // false — 👍 is two code units

// The u flag enables Unicode mode — patterns match code points
/^.$/u.test("👍");  // true
/^.$/u.test("𝕳");   // true

// Character classes also change behavior
/[\u{1F300}-\u{1F9FF}]/u.test("👍");  // true (emoji range)

The u flag in regex enables Unicode mode, which: - Makes . match any Unicode code point (including supplementary characters) - Enables \u{NNNN} syntax for code points above U+FFFF - Makes \w, \d, etc. respect Unicode properties in some patterns - Changes quantifier behavior for surrogate pairs

For production code handling user-generated content, always use the u flag in regex patterns that may encounter emoji or other supplementary characters.

The .NET and Java Situation

Java uses UTF-16 internally with the same char type representing a single UTF-16 code unit. The pattern is identical to JavaScript:

String s = "👍";
s.length();              // 2
s.charAt(0);             // '\uD83D' (high surrogate, not a real character)
s.codePointAt(0);        // 128077 (correct)
s.codePointCount(0, s.length());  // 1 (correct count of code points)

// Correct iteration
s.codePoints().forEach(cp -> System.out.println(Character.toString(cp)));

.NET is similar. string.Length returns UTF-16 code unit count. Use StringInfo.GetTextElementEnumerator or EnumerateRunes() for code-point-correct iteration:

var s = "👍";
s.Length;                           // 2
s.EnumerateRunes().Count();         // 1 (correct)

foreach (var rune in s.EnumerateRunes()) {
    Console.WriteLine(rune);        // 👍
}

UTF-16 in Files: When Byte Order Matters

When UTF-16 is stored in files, byte order becomes relevant. A UTF-16 file must declare whether it's big-endian (UTF-16 BE) or little-endian (UTF-16 LE), typically via a BOM:

UTF-16 BE BOM: FE FF
UTF-16 LE BOM: FF FE

Without a BOM, the receiver must either be told the byte order or guess. This is one of UTF-16's practical drawbacks compared to UTF-8 — there's no ambiguity in UTF-8 because it has no byte order.

# Python handles both UTF-16 variants
with open('file.txt', encoding='utf-16') as f:  # Auto-detects LE/BE from BOM
    content = f.read()

with open('file.txt', encoding='utf-16-le') as f:  # Explicit little-endian
    content = f.read()

# Encoding detection: check the first two bytes
with open('file.txt', 'rb') as f:
    bom = f.read(2)
    if bom == b'\xff\xfe':
        print("UTF-16 LE")
    elif bom == b'\xfe\xff':
        print("UTF-16 BE")
    elif bom[:3] == b'\xef\xbb\xbf':
        print("UTF-8 with BOM")

When Does UTF-16 Actually Matter?

For most web developers, UTF-16's main relevance is JavaScript string behavior. But UTF-16 files appear in specific contexts:

Windows APIs: Win32 APIs that accept strings use UTF-16 LE internally. Files produced by Windows tools (Notepad with Unicode setting, some XML/XSLT processors) may be UTF-16.

Microsoft Office documents: .docx, .xlsx formats use UTF-16 in some internal XML components.

Java .class files and .jar files: Java's Constant Pool uses Modified UTF-8 (a variant that encodes null as 0xC0 0x80 and uses CESU-8 for supplementary characters), but the runtime uses UTF-16 internally.

Database exports: Some database export tools on Windows default to UTF-16. If you open a CSV in Excel from a UTF-16 source, Excel handles it correctly — but Python's csv module needs explicit encoding.

import csv

# Reading a UTF-16 CSV
with open('windows_export.csv', encoding='utf-16') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row)

Practical Summary for JavaScript Developers

Operation	Use this
String length in code units	`str.length`
String length in code points	`[...str].length` or `Array.from(str).length`
Grapheme cluster count (user-visible)	`[...new Intl.Segmenter().segment(str)].length`
Iterate characters correctly	`for...of` or `Array.from()`
Get code point at index	`str.codePointAt(i)`
Get char from code point	`String.fromCodePoint(cp)`
Regex matching emoji/supplementary	Always add `u` flag
Slice without cutting surrogates	Use `[...str].slice().join('')` or `Intl.Segmenter`

You can inspect any character's UTF-16 representation — code units, surrogate pairs, and byte sequences — with our Encoding Converter and Character Counter tools.

The underlying lesson: JavaScript (and Java, and .NET) strings are not sequences of characters in the linguistic sense. They're sequences of UTF-16 code units, and the difference only matters when your text contains code points above U+FFFF. In practice, that means: emoji, mathematical symbols, some historic scripts, and CJK extension characters. If your application handles user-generated text in 2024, that's no longer an edge case.

Next in Series: Legacy Encodings: Latin-1, Windows-1252, Shift-JIS, and When You Still Need Them — navigating the encodings that predate Unicode and still show up in production data.