A surrogate pair is a mechanism in UTF-16 encoding that allows characters outside the Basic Multilingual Plane (BMP) -- those with code points above U+FFFF -- to be represented using two 16-bit code units. Since UTF-16 was originally conceived as a fixed-width 16-bit encoding, surrogates were a later addition to handle the full Unicode range without breaking existing systems.
Why Surrogates Exist
UTF-16 was designed when Unicode was expected to fit within 65,536 code points (the BMP). When Unicode expanded to 1,114,112 code points (U+000000-U+10FFFF), a way was needed to encode the additional code points using the existing 16-bit infrastructure. The Unicode consortium reserved two ranges of 2,048 code points each:
- High surrogates: U+D800-U+DBFF (1,024 values)
- Low surrogates: U+DC00-U+DFFF (1,024 values)
These ranges are permanently reserved and are never assigned to actual characters -- they exist solely as surrogate code units. A high surrogate followed by a low surrogate together encode one supplementary character.
Decoding a Surrogate Pair
To decode a surrogate pair to a Unicode code point:
- Take the high surrogate
H(U+D800-U+DBFF) - Take the low surrogate
L(U+DC00-U+DFFF) - Code point =
(H - 0xD800) x 0x400 + (L - 0xDC00) + 0x10000
Example: emoji U+1F600
| Step | Value |
|---|---|
| Code point | 0x1F600 |
| High surrogate | 0xD83D |
| Low surrogate | 0xDE00 |
| Encoded in UTF-16 LE | D8 3D DE 00 |
# Python: manually computing surrogates
code_point = 0x1F600
high = ((code_point - 0x10000) >> 10) + 0xD800
low = ((code_point - 0x10000) & 0x3FF) | 0xDC00
print(hex(high), hex(low)) # 0xd83d 0xde00
# Python handles this automatically
text = '\U0001F600'
print(text.encode('utf-16-le').hex()) # 3dd800de
// JavaScript exposes surrogate pairs directly
const emoji = '\uD83D\uDE00';
console.log(emoji.charCodeAt(0).toString(16)); // 'd83d' (high surrogate)
console.log(emoji.charCodeAt(1).toString(16)); // 'de00' (low surrogate)
console.log(emoji.codePointAt(0).toString(16)); // '1f600' (full code point)
// Safe character counting (handles surrogates)
console.log([...'Hello \uD83D\uDE00'].length); // 7
console.log('Hello \uD83D\uDE00'.length); // 8 (counts code units)
Lone Surrogates and WTF-8
A lone surrogate -- a high surrogate not followed by a low surrogate, or vice versa -- is technically invalid in well-formed Unicode. However, JavaScript strings can contain lone surrogates (since they are sequences of arbitrary UTF-16 code units). Encoding a JavaScript string with a lone surrogate to UTF-8 via TextEncoder will replace it with U+FFFD.
WTF-8 (Wobbly Transformation Format) is an informal encoding variant that extends UTF-8 to losslessly round-trip lone surrogates, used internally by some systems (notably Rust's OsString on Windows).
Practical Impact
Surrogate pairs affect any code that processes string length, slicing, or iteration. In JavaScript, slicing a string at an index that falls between a surrogate pair produces a lone (invalid) surrogate. Always use Array.from() or for...of when iterating over strings that may contain emoji or other supplementary characters.