Unicode in JavaScript
JavaScript's string type uses UTF-16 encoding internally, which creates important implications for working with characters outside the Basic Multilingual Plane (BMP)—particularly emoji and characters with code points above U+FFFF. Modern JavaScript (ES2015+) introduced several APIs to work with Unicode more correctly.
JavaScript String Encoding
JavaScript strings are sequences of UTF-16 code units. Characters in the BMP (U+0000–U+FFFF) occupy one code unit; supplementary characters (U+10000–U+10FFFF) are encoded as surrogate pairs—two code units working together:
'A'.length // 1 (U+0041, single code unit)
'é'.length // 1 (U+00E9, single code unit)
'☃'.length // 1 (U+2603, single code unit)
'😀'.length // 2 (U+1F600, surrogate pair: 0xD83D 0xDE00)
'👨👩👧👦'.length // 11 (emoji sequence with ZWJ)
Unicode Escape Sequences
// 4-digit hex escape (BMP only)
'\u2603' // '☃' (U+2603)
// ES2015 brace notation (any code point)
'\u{2603}' // '☃' (U+2603)
'\u{1F600}' // '😀' (U+1F600)
// Code point to character
String.fromCodePoint(0x2603) // '☃'
String.fromCodePoint(0x1F600) // '😀'
String.fromCodePoint(65, 66) // 'AB'
Code Points vs. Code Units
// Old API: works with code units (UTF-16)
'😀'.charCodeAt(0) // 55357 (0xD83D, high surrogate)
'😀'.charCodeAt(1) // 56832 (0xDE00, low surrogate)
// ES2015 API: works with code points
'😀'.codePointAt(0) // 128512 (0x1F600, correct)
'😀'.codePointAt(1) // 56832 (starts from second code unit — be careful)
// Convert code point to character
String.fromCharCode(0xD83D, 0xDE00) // '😀' (manual surrogate pair)
String.fromCodePoint(0x1F600) // '😀' (clean code point API)
Iterating Over Characters
Do not use numeric index iteration for strings that may contain supplementary characters:
const text = '☃😀A';
// WRONG: iterates code units, breaks surrogate pairs
for (let i = 0; i < text.length; i++) {
console.log(text[i]); // '☃', '\uD83D', '\uDE00', 'A'
}
// CORRECT: for...of iterates code points
for (const char of text) {
console.log(char); // '☃', '😀', 'A'
}
// Spread also uses the iterator
[...text] // ['☃', '😀', 'A']
Array.from(text) // ['☃', '😀', 'A']
// Correct character count
Array.from(text).length // 3
Regular Expressions and the u Flag
The u flag makes regex operate on code points rather than code units:
// WITHOUT u flag: . matches code units, not code points
/^.$/.test('😀') // false (emoji is 2 code units)
/^..$/.test('😀') // true (matches both surrogate code units)
// WITH u flag: . matches full code points
/^.$/u.test('😀') // true
// Unicode property escapes (ES2018, requires u flag)
/\p{Emoji}/u.test('😀') // true
/\p{Script=Latin}/u.test('A') // true
/\p{Script=Cyrillic}/u.test('А') // true
/\p{Number}/u.test('²') // true
Normalization
// é can be represented two ways:
const nfc = '\u00E9'; // Precomposed: é (1 code point)
const nfd = 'e\u0301'; // Decomposed: e + combining acute (2 code points)
nfc === nfd // false (different code points)
nfc.normalize() === nfd.normalize() // true (both normalize to NFC)
nfc.normalize('NFD') === nfd.normalize('NFD') // true
Grapheme Clusters with Intl.Segmenter
For user-visible character counting (splitting emoji sequences correctly):
const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
const segments = [...segmenter.segment('👨👩👧👦')];
segments.length // 1 (one family emoji, one grapheme cluster)