SymbolFYI

Surrogate Pair

Encoding
التعريف

A pair of 16-bit code units in UTF-16 that together represent a single character outside the Basic Multilingual Plane (above U+FFFF).

A surrogate pair is a mechanism in UTF-16 encoding that allows characters outside the Basic Multilingual Plane (BMP) -- those with code points above U+FFFF -- to be represented using two 16-bit code units. Since UTF-16 was originally conceived as a fixed-width 16-bit encoding, surrogates were a later addition to handle the full Unicode range without breaking existing systems.

Why Surrogates Exist

UTF-16 was designed when Unicode was expected to fit within 65,536 code points (the BMP). When Unicode expanded to 1,114,112 code points (U+000000-U+10FFFF), a way was needed to encode the additional code points using the existing 16-bit infrastructure. The Unicode consortium reserved two ranges of 2,048 code points each:

  • High surrogates: U+D800-U+DBFF (1,024 values)
  • Low surrogates: U+DC00-U+DFFF (1,024 values)

These ranges are permanently reserved and are never assigned to actual characters -- they exist solely as surrogate code units. A high surrogate followed by a low surrogate together encode one supplementary character.

Decoding a Surrogate Pair

To decode a surrogate pair to a Unicode code point:

  1. Take the high surrogate H (U+D800-U+DBFF)
  2. Take the low surrogate L (U+DC00-U+DFFF)
  3. Code point = (H - 0xD800) x 0x400 + (L - 0xDC00) + 0x10000

Example: emoji U+1F600

Step Value
Code point 0x1F600
High surrogate 0xD83D
Low surrogate 0xDE00
Encoded in UTF-16 LE D8 3D DE 00
# Python: manually computing surrogates
code_point = 0x1F600
high = ((code_point - 0x10000) >> 10) + 0xD800
low  = ((code_point - 0x10000) & 0x3FF) | 0xDC00
print(hex(high), hex(low))  # 0xd83d 0xde00

# Python handles this automatically
text = '\U0001F600'
print(text.encode('utf-16-le').hex())  # 3dd800de
// JavaScript exposes surrogate pairs directly
const emoji = '\uD83D\uDE00';
console.log(emoji.charCodeAt(0).toString(16)); // 'd83d' (high surrogate)
console.log(emoji.charCodeAt(1).toString(16)); // 'de00' (low surrogate)
console.log(emoji.codePointAt(0).toString(16)); // '1f600' (full code point)

// Safe character counting (handles surrogates)
console.log([...'Hello \uD83D\uDE00'].length); // 7
console.log('Hello \uD83D\uDE00'.length);      // 8 (counts code units)

Lone Surrogates and WTF-8

A lone surrogate -- a high surrogate not followed by a low surrogate, or vice versa -- is technically invalid in well-formed Unicode. However, JavaScript strings can contain lone surrogates (since they are sequences of arbitrary UTF-16 code units). Encoding a JavaScript string with a lone surrogate to UTF-8 via TextEncoder will replace it with U+FFFD.

WTF-8 (Wobbly Transformation Format) is an informal encoding variant that extends UTF-8 to losslessly round-trip lone surrogates, used internally by some systems (notably Rust's OsString on Windows).

Practical Impact

Surrogate pairs affect any code that processes string length, slicing, or iteration. In JavaScript, slicing a string at an index that falls between a surrogate pair produces a lone (invalid) surrogate. Always use Array.from() or for...of when iterating over strings that may contain emoji or other supplementary characters.

الرموز ذات الصلة

المصطلحات ذات الصلة

الأدوات ذات الصلة

الأدلة ذات الصلة