JavaScript and Unicode: Strings, Code Points, and Grapheme Clusters
- ○ 1. HTML Entities: The Complete Guide to Character References
- ○ 2. CSS Content Property: Using Unicode Symbols in Stylesheets
- ○ 3. Unicode-Aware Regex: Property Escapes and Multilingual Patterns
- ● 4. JavaScript and Unicode: Strings, Code Points, and Grapheme Clusters
- ○ 5. Python and Unicode: The Complete Developer's Guide
- ○ 6. Unicode in URLs: Percent-Encoding, Punycode, and IRIs
- ○ 7. IDN Homograph Attacks: When Unicode Becomes a Security Threat
- ○ 8. Web Fonts and Unicode Subsetting: Loading Only What You Need
- ○ 9. Character Encoding Detection: How to Identify Unknown Text Encoding
- ○ 10. Unicode Collation: How to Sort Text Correctly Across Languages
JavaScript strings are sequences of UTF-16 code units. This was a reasonable design choice in the mid-1990s when Unicode was expected to fit in 65,536 code points, but Unicode has since grown to over 1.1 million code points. The result: JavaScript inherited a string model that requires special handling for emoji, mathematical symbols, and many other characters in common use today.
Understanding this model is not optional for internationalized applications. It determines how string length is calculated, how for loops iterate, how substrings are split, and whether your UI components display characters correctly.
UTF-16 and Surrogate Pairs
JavaScript stores strings as sequences of 16-bit code units. Unicode code points from U+0000 to U+FFFF (the Basic Multilingual Plane, or BMP) each occupy exactly one code unit. Code points from U+10000 to U+10FFFF — which include most emoji, many mathematical symbols, and some historic scripts — require two code units called a surrogate pair.
// BMP character: one code unit
"A".length // 1 (U+0041)
"字".length // 1 (U+5B57)
"€".length // 1 (U+20AC)
// Supplementary character: two code units (surrogate pair)
"😀".length // 2 (U+1F600, encoded as U+D83D U+DE00)
"𝄞".length // 2 (musical symbol G clef, U+1D11E)
"𝕳".length // 2 (mathematical fraktur H, U+1D573)
// String with both BMP and supplementary:
"Hi 😀".length // 5 — H(1) + i(1) + space(1) + emoji(2)
A surrogate pair consists of a high surrogate (U+D800–U+DBFF) followed by a low surrogate (U+DC00–U+DFFF). These code points are reserved exclusively for this encoding mechanism — they never represent actual characters on their own.
The Right Way to Iterate Over Characters
The old for loop iterates over code units, breaking surrogate pairs apart:
const str = "Hi 😀";
// Wrong: iterates code units, splits emoji
for (let i = 0; i < str.length; i++) {
console.log(str[i]); // H, i, ' ', '\uD83D', '\uDE00'
console.log(str.charCodeAt(i)); // 72, 105, 32, 55357, 56832
}
// Correct: for...of iterates code points
for (const char of str) {
console.log(char); // H, i, ' ', 😀
console.log(char.codePointAt(0)); // 72, 105, 32, 128512
}
// Spread operator also iterates code points
const chars = [...str]; // ['H', 'i', ' ', '😀']
chars.length // 4 — code point count
The for...of loop and spread operator use the string's iterator protocol, which correctly handles surrogate pairs and yields complete code points.
codePointAt vs. charCodeAt
const emoji = "😀";
// charCodeAt: returns the code unit value (wrong for surrogates)
emoji.charCodeAt(0) // 55357 — high surrogate U+D83D
emoji.charCodeAt(1) // 56832 — low surrogate U+DE00
// codePointAt: returns the full code point
emoji.codePointAt(0) // 128512 — U+1F600
emoji.codePointAt(1) // 56832 — low surrogate (if accessed directly, still wrong)
// Safe code point extraction:
function getCodePoints(str) {
return [...str].map(ch => ch.codePointAt(0));
}
getCodePoints("Hi 😀") // [72, 105, 32, 128512]
Similarly, use String.fromCodePoint() rather than String.fromCharCode() when constructing strings from code points:
// fromCharCode breaks for supplementary code points:
String.fromCharCode(128512) // '' — wrong, 128512 > 65535
String.fromCharCode(0xD83D, 0xDE00) // '😀' — you'd need to compute surrogates manually
// fromCodePoint handles everything correctly:
String.fromCodePoint(128512) // '😀'
String.fromCodePoint(65, 66, 67) // 'ABC'
String.fromCodePoint(0x1F600) // '😀'
Grapheme Clusters: What Users See
Code points are still not the same as what users perceive as "one character." A grapheme cluster is the smallest unit of text that is meaningful to a user — what they see when they press the arrow key once in a text field.
Examples of multi-code-point grapheme clusters:
// Combining mark: base character + combining accent
const e_acute = "e\u0301"; // e + combining acute accent (two code points)
[...e_acute].length // 2 — but it looks like one character: é
// Skin tone modifier: emoji + modifier
const waving_hand = "👋🏽"; // U+1F44B + U+1F3FD
[...waving_hand].length // 2 — one "character" visually
// ZWJ sequence: multiple emoji joined by zero-width joiner
const family = "👨👩👧"; // man + ZWJ + woman + ZWJ + girl
[...family].length // 5 — but users see one "character"
// Country flag: two regional indicator symbols
const us_flag = "🇺🇸"; // U+1F1FA + U+1F1F8
[...us_flag].length // 2 — renders as one flag
For user-facing string operations — character counting, text truncation, cursor positioning — you need grapheme cluster segmentation.
Intl.Segmenter: The Modern Solution
Intl.Segmenter (ES2022, supported in all modern browsers and Node.js 16+) provides locale-aware text segmentation:
// Grapheme segmentation
const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
function countGraphemes(str) {
return [...segmenter.segment(str)].length;
}
countGraphemes("café") // 4 — even if 'é' is decomposed
countGraphemes("👨💻") // 1 — ZWJ sequence is one grapheme
countGraphemes("🇺🇸") // 1 — flag is one grapheme
countGraphemes("Hi 😀") // 4 — H, i, space, emoji
// Truncate to N graphemes without breaking sequences
function truncateGraphemes(str, maxLength, suffix = '…') {
const segments = [...segmenter.segment(str)];
if (segments.length <= maxLength) return str;
return segments
.slice(0, maxLength)
.map(s => s.segment)
.join('') + suffix;
}
truncateGraphemes("Hello 👨💻 World", 8) // "Hello 👨💻 …"
For word and sentence segmentation:
const wordSegmenter = new Intl.Segmenter('en', { granularity: 'word' });
const sentenceSegmenter = new Intl.Segmenter('ja', { granularity: 'sentence' });
// Word segmentation (useful for word count, search highlighting)
const words = [...wordSegmenter.segment("Hello, world! How are you?")]
.filter(s => s.isWordLike)
.map(s => s.segment);
// ["Hello", "world", "How", "are", "you"]
// Japanese sentence segmentation (no spaces between words in Japanese)
const sentences = [...sentenceSegmenter.segment("日本語のテキスト。次の文。")]
.map(s => s.segment);
// ["日本語のテキスト。", "次の文。"]
String Normalization
The same visual character can be encoded multiple ways. Unicode normalization resolves this before comparison or storage:
// Precomposed (NFC): single code point é = U+00E9
const nfc = "\u00E9";
// Decomposed (NFD): e + combining accent = U+0065 + U+0301
const nfd = "e\u0301";
nfc === nfd // false — different byte sequences
nfc.length // 1
nfd.length // 2
nfc.normalize('NFC') === nfd.normalize('NFC') // true
// NFC: Canonical Decomposition, followed by Canonical Composition (most common for storage)
// NFD: Canonical Decomposition (good for processing accents)
// NFKC: Compatibility Decomposition + Composition (for search/comparison)
// NFKD: Compatibility Decomposition
// NFKC normalizes compatibility equivalents:
"\uFB01".normalize('NFKC') // 'fi' — fi ligature → two letters
"2".normalize('NFKC') // '2' — fullwidth digit → ASCII digit
Always normalize before: - Comparing strings that may come from different sources - Storing strings in a database (choose NFC for web content) - Measuring string length for display purposes
// Production pattern: normalize on input
function normalizeInput(str) {
return str.normalize('NFC').trim();
}
Practical: Safe String Operations
Safe string length for UI
function graphemeLength(str) {
return [...new Intl.Segmenter().segment(str)].length;
}
// For older environments without Intl.Segmenter:
function codePointLength(str) {
return [...str].length; // better than .length but ignores combining marks
}
Safe substring/truncation
function safeSubstring(str, start, end) {
const segments = [...new Intl.Segmenter().segment(str)];
return segments
.slice(start, end)
.map(s => s.segment)
.join('');
}
safeSubstring("Hi 😀 there", 0, 5) // "Hi 😀 "
Safe includes/indexOf for non-BMP
// String.prototype.includes works correctly:
"Hi 😀 there".includes("😀") // true
// indexOf counts code units — use with care:
"Hi 😀 there".indexOf("😀") // 3 (correct by chance)
"😀 Hi".indexOf("H") // 3 — because emoji took indices 0 and 1
Reversing a string
// Wrong — breaks surrogate pairs and combining marks:
"café 😀".split('').reverse().join('')
// → '�😀 éfac' or similar garbage
// Correct — reverse grapheme clusters:
function reverseString(str) {
return [...new Intl.Segmenter().segment(str)]
.map(s => s.segment)
.reverse()
.join('');
}
reverseString("café 😀") // "😀 éfac" — é preserved as single grapheme
Template Literals and Unicode
Template literals handle Unicode correctly since they use the same JS string model. But be careful with tagged templates processing raw strings:
// Tagged template raw strings see \uXXXX escapes literally:
String.raw`\u2014` // "\\u2014" — backslash + u + 2014
// Untagged template: escape is interpreted
`\u2014` // "—" — em dash
// Unicode code point escape in templates:
`\u{1F600}` // "😀" — works with the u-notation in template strings
Encoding and Decoding
// TextEncoder / TextDecoder (Web API + Node.js)
const encoder = new TextEncoder(); // always UTF-8
const decoder = new TextDecoder(); // UTF-8 by default
const bytes = encoder.encode("Hello 😀");
// Uint8Array: [72, 101, 108, 108, 111, 32, 240, 159, 152, 128]
// Note: emoji is 4 bytes in UTF-8
const back = decoder.decode(bytes); // "Hello 😀"
// For other encodings:
const win1252 = new TextDecoder('windows-1252');
Use the SymbolFYI Character Counter to see the code point, UTF-16 encoding, and grapheme cluster breakdown of any character or string. The Encoding Converter shows byte representations across UTF-8, UTF-16, and UTF-32.
Quick Reference
| Task | Method | Notes |
|---|---|---|
| Code point count | [...str].length |
Better than .length |
| Grapheme count | Intl.Segmenter |
Most accurate for UI |
| Iterate characters | for...of |
Handles surrogates |
| Get code point | codePointAt(0) |
Use instead of charCodeAt |
| Build from code point | String.fromCodePoint() |
Use instead of fromCharCode |
| Normalize | str.normalize('NFC') |
Always before comparison |
| Truncate safely | Custom with Segmenter | Avoid splitting sequences |
Next in Series: Python and Unicode: The Complete Developer's Guide — Python's str/bytes model, the unicodedata module, encoding/decoding, and the Unicode sandwich pattern for robust text processing.