SymbolFYI

Surrogate Pair

Encoding
परिभाषा

A pair of 16-bit code units in UTF-16 that together represent a single character outside the Basic Multilingual Plane (above U+FFFF).

A surrogate pair is a mechanism in UTF-16 encoding that allows characters outside the Basic Multilingual Plane (BMP) -- those with code points above U+FFFF -- to be represented using two 16-bit code units. Since UTF-16 was originally conceived as a fixed-width 16-bit encoding, surrogates were a later addition to handle the full Unicode range without breaking existing systems.

Why Surrogates Exist

UTF-16 was designed when Unicode was expected to fit within 65,536 code points (the BMP). When Unicode expanded to 1,114,112 code points (U+000000-U+10FFFF), a way was needed to encode the additional code points using the existing 16-bit infrastructure. The Unicode consortium reserved two ranges of 2,048 code points each:

  • High surrogates: U+D800-U+DBFF (1,024 values)
  • Low surrogates: U+DC00-U+DFFF (1,024 values)

These ranges are permanently reserved and are never assigned to actual characters -- they exist solely as surrogate code units. A high surrogate followed by a low surrogate together encode one supplementary character.

Decoding a Surrogate Pair

To decode a surrogate pair to a Unicode code point:

  1. Take the high surrogate H (U+D800-U+DBFF)
  2. Take the low surrogate L (U+DC00-U+DFFF)
  3. Code point = (H - 0xD800) x 0x400 + (L - 0xDC00) + 0x10000

Example: emoji U+1F600

Step Value
Code point 0x1F600
High surrogate 0xD83D
Low surrogate 0xDE00
Encoded in UTF-16 LE D8 3D DE 00
# Python: manually computing surrogates
code_point = 0x1F600
high = ((code_point - 0x10000) >> 10) + 0xD800
low  = ((code_point - 0x10000) & 0x3FF) | 0xDC00
print(hex(high), hex(low))  # 0xd83d 0xde00

# Python handles this automatically
text = '\U0001F600'
print(text.encode('utf-16-le').hex())  # 3dd800de
// JavaScript exposes surrogate pairs directly
const emoji = '\uD83D\uDE00';
console.log(emoji.charCodeAt(0).toString(16)); // 'd83d' (high surrogate)
console.log(emoji.charCodeAt(1).toString(16)); // 'de00' (low surrogate)
console.log(emoji.codePointAt(0).toString(16)); // '1f600' (full code point)

// Safe character counting (handles surrogates)
console.log([...'Hello \uD83D\uDE00'].length); // 7
console.log('Hello \uD83D\uDE00'.length);      // 8 (counts code units)

Lone Surrogates and WTF-8

A lone surrogate -- a high surrogate not followed by a low surrogate, or vice versa -- is technically invalid in well-formed Unicode. However, JavaScript strings can contain lone surrogates (since they are sequences of arbitrary UTF-16 code units). Encoding a JavaScript string with a lone surrogate to UTF-8 via TextEncoder will replace it with U+FFFD.

WTF-8 (Wobbly Transformation Format) is an informal encoding variant that extends UTF-8 to losslessly round-trip lone surrogates, used internally by some systems (notably Rust's OsString on Windows).

Practical Impact

Surrogate pairs affect any code that processes string length, slicing, or iteration. In JavaScript, slicing a string at an index that falls between a surrogate pair produces a lone (invalid) surrogate. Always use Array.from() or for...of when iterating over strings that may contain emoji or other supplementary characters.

संबंधित प्रतीक

संबंधित शब्द

संबंधित टूल

संबंधित गाइड