JavaScript and Unicode: Strings, Code Points, and Grapheme Clusters

Web Development Symbols for Developers Mar 19, 2024

○ 1. HTML Entities: The Complete Guide to Character References
○ 2. CSS Content Property: Using Unicode Symbols in Stylesheets
○ 3. Unicode-Aware Regex: Property Escapes and Multilingual Patterns
● 4. JavaScript and Unicode: Strings, Code Points, and Grapheme Clusters
○ 5. Python and Unicode: The Complete Developer's Guide
○ 6. Unicode in URLs: Percent-Encoding, Punycode, and IRIs
○ 7. IDN Homograph Attacks: When Unicode Becomes a Security Threat
○ 8. Web Fonts and Unicode Subsetting: Loading Only What You Need
○ 9. Character Encoding Detection: How to Identify Unknown Text Encoding
○ 10. Unicode Collation: How to Sort Text Correctly Across Languages

Table of Contents

JavaScript strings are sequences of UTF-16 code units. This was a reasonable design choice in the mid-1990s when Unicode was expected to fit in 65,536 code points, but Unicode has since grown to over 1.1 million code points. The result: JavaScript inherited a string model that requires special handling for emoji, mathematical symbols, and many other characters in common use today.

Understanding this model is not optional for internationalized applications. It determines how string length is calculated, how for loops iterate, how substrings are split, and whether your UI components display characters correctly.

UTF-16 and Surrogate Pairs

JavaScript stores strings as sequences of 16-bit code units. Unicode code points from U+0000 to U+FFFF (the Basic Multilingual Plane, or BMP) each occupy exactly one code unit. Code points from U+10000 to U+10FFFF — which include most emoji, many mathematical symbols, and some historic scripts — require two code units called a surrogate pair.

// BMP character: one code unit
"A".length            // 1  (U+0041)
"字".length            // 1  (U+5B57)
"€".length            // 1  (U+20AC)

// Supplementary character: two code units (surrogate pair)
"😀".length           // 2  (U+1F600, encoded as U+D83D U+DE00)
"𝄞".length            // 2  (musical symbol G clef, U+1D11E)
"𝕳".length            // 2  (mathematical fraktur H, U+1D573)

// String with both BMP and supplementary:
"Hi 😀".length        // 5  — H(1) + i(1) + space(1) + emoji(2)

A surrogate pair consists of a high surrogate (U+D800–U+DBFF) followed by a low surrogate (U+DC00–U+DFFF). These code points are reserved exclusively for this encoding mechanism — they never represent actual characters on their own.

The Right Way to Iterate Over Characters

The old for loop iterates over code units, breaking surrogate pairs apart:

const str = "Hi 😀";

// Wrong: iterates code units, splits emoji
for (let i = 0; i < str.length; i++) {
  console.log(str[i]);       // H, i, ' ', '\uD83D', '\uDE00'
  console.log(str.charCodeAt(i));  // 72, 105, 32, 55357, 56832
}

// Correct: for...of iterates code points
for (const char of str) {
  console.log(char);          // H, i, ' ', 😀
  console.log(char.codePointAt(0)); // 72, 105, 32, 128512
}

// Spread operator also iterates code points
const chars = [...str];       // ['H', 'i', ' ', '😀']
chars.length                  // 4 — code point count

The for...of loop and spread operator use the string's iterator protocol, which correctly handles surrogate pairs and yields complete code points.

`codePointAt` vs. `charCodeAt`

const emoji = "😀";

// charCodeAt: returns the code unit value (wrong for surrogates)
emoji.charCodeAt(0)   // 55357 — high surrogate U+D83D
emoji.charCodeAt(1)   // 56832 — low surrogate U+DE00

// codePointAt: returns the full code point
emoji.codePointAt(0)  // 128512 — U+1F600
emoji.codePointAt(1)  // 56832 — low surrogate (if accessed directly, still wrong)

// Safe code point extraction:
function getCodePoints(str) {
  return [...str].map(ch => ch.codePointAt(0));
}

getCodePoints("Hi 😀")  // [72, 105, 32, 128512]

Similarly, use String.fromCodePoint() rather than String.fromCharCode() when constructing strings from code points:

// fromCharCode breaks for supplementary code points:
String.fromCharCode(128512)        // '' — wrong, 128512 > 65535
String.fromCharCode(0xD83D, 0xDE00) // '😀' — you'd need to compute surrogates manually

// fromCodePoint handles everything correctly:
String.fromCodePoint(128512)       // '😀'
String.fromCodePoint(65, 66, 67)   // 'ABC'
String.fromCodePoint(0x1F600)      // '😀'

Grapheme Clusters: What Users See

Code points are still not the same as what users perceive as "one character." A grapheme cluster is the smallest unit of text that is meaningful to a user — what they see when they press the arrow key once in a text field.

Examples of multi-code-point grapheme clusters:

// Combining mark: base character + combining accent
const e_acute = "e\u0301";  // e + combining acute accent (two code points)
[...e_acute].length          // 2 — but it looks like one character: é

// Skin tone modifier: emoji + modifier
const waving_hand = "👋🏽";  // U+1F44B + U+1F3FD
[...waving_hand].length       // 2 — one "character" visually

// ZWJ sequence: multiple emoji joined by zero-width joiner
const family = "👨‍👩‍👧";  // man + ZWJ + woman + ZWJ + girl
[...family].length             // 5 — but users see one "character"

// Country flag: two regional indicator symbols
const us_flag = "🇺🇸";  // U+1F1FA + U+1F1F8
[...us_flag].length      // 2 — renders as one flag

For user-facing string operations — character counting, text truncation, cursor positioning — you need grapheme cluster segmentation.

`Intl.Segmenter`: The Modern Solution

Intl.Segmenter (ES2022, supported in all modern browsers and Node.js 16+) provides locale-aware text segmentation:

// Grapheme segmentation
const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });

function countGraphemes(str) {
  return [...segmenter.segment(str)].length;
}

countGraphemes("café")    // 4 — even if 'é' is decomposed
countGraphemes("👨‍💻")   // 1 — ZWJ sequence is one grapheme
countGraphemes("🇺🇸")    // 1 — flag is one grapheme
countGraphemes("Hi 😀")  // 4 — H, i, space, emoji

// Truncate to N graphemes without breaking sequences
function truncateGraphemes(str, maxLength, suffix = '…') {
  const segments = [...segmenter.segment(str)];
  if (segments.length <= maxLength) return str;
  return segments
    .slice(0, maxLength)
    .map(s => s.segment)
    .join('') + suffix;
}

truncateGraphemes("Hello 👨‍💻 World", 8)  // "Hello 👨‍💻 …"

For word and sentence segmentation:

const wordSegmenter = new Intl.Segmenter('en', { granularity: 'word' });
const sentenceSegmenter = new Intl.Segmenter('ja', { granularity: 'sentence' });

// Word segmentation (useful for word count, search highlighting)
const words = [...wordSegmenter.segment("Hello, world! How are you?")]
  .filter(s => s.isWordLike)
  .map(s => s.segment);
// ["Hello", "world", "How", "are", "you"]

// Japanese sentence segmentation (no spaces between words in Japanese)
const sentences = [...sentenceSegmenter.segment("日本語のテキスト。次の文。")]
  .map(s => s.segment);
// ["日本語のテキスト。", "次の文。"]

String Normalization

The same visual character can be encoded multiple ways. Unicode normalization resolves this before comparison or storage:

// Precomposed (NFC): single code point é = U+00E9
const nfc = "\u00E9";

// Decomposed (NFD): e + combining accent = U+0065 + U+0301
const nfd = "e\u0301";

nfc === nfd          // false — different byte sequences
nfc.length           // 1
nfd.length           // 2
nfc.normalize('NFC') === nfd.normalize('NFC')  // true

// NFC: Canonical Decomposition, followed by Canonical Composition (most common for storage)
// NFD: Canonical Decomposition (good for processing accents)
// NFKC: Compatibility Decomposition + Composition (for search/comparison)
// NFKD: Compatibility Decomposition

// NFKC normalizes compatibility equivalents:
"\uFB01".normalize('NFKC')  // 'fi' — fi ligature → two letters
"２".normalize('NFKC')      // '2' — fullwidth digit → ASCII digit

Always normalize before: - Comparing strings that may come from different sources - Storing strings in a database (choose NFC for web content) - Measuring string length for display purposes

// Production pattern: normalize on input
function normalizeInput(str) {
  return str.normalize('NFC').trim();
}

Practical: Safe String Operations

Safe string length for UI

function graphemeLength(str) {
  return [...new Intl.Segmenter().segment(str)].length;
}

// For older environments without Intl.Segmenter:
function codePointLength(str) {
  return [...str].length;  // better than .length but ignores combining marks
}

Safe substring/truncation

function safeSubstring(str, start, end) {
  const segments = [...new Intl.Segmenter().segment(str)];
  return segments
    .slice(start, end)
    .map(s => s.segment)
    .join('');
}

safeSubstring("Hi 😀 there", 0, 5)  // "Hi 😀 "

Safe includes/indexOf for non-BMP

// String.prototype.includes works correctly:
"Hi 😀 there".includes("😀")   // true

// indexOf counts code units — use with care:
"Hi 😀 there".indexOf("😀")    // 3 (correct by chance)
"😀 Hi".indexOf("H")           // 3 — because emoji took indices 0 and 1

Reversing a string

// Wrong — breaks surrogate pairs and combining marks:
"café 😀".split('').reverse().join('')
// → '�😀 éfac' or similar garbage

// Correct — reverse grapheme clusters:
function reverseString(str) {
  return [...new Intl.Segmenter().segment(str)]
    .map(s => s.segment)
    .reverse()
    .join('');
}

reverseString("café 😀")  // "😀 éfac" — é preserved as single grapheme

Template Literals and Unicode

Template literals handle Unicode correctly since they use the same JS string model. But be careful with tagged templates processing raw strings:

// Tagged template raw strings see \uXXXX escapes literally:
String.raw`\u2014`  // "\\u2014" — backslash + u + 2014

// Untagged template: escape is interpreted
`\u2014`  // "—" — em dash

// Unicode code point escape in templates:
`\u{1F600}`  // "😀" — works with the u-notation in template strings

Encoding and Decoding

// TextEncoder / TextDecoder (Web API + Node.js)
const encoder = new TextEncoder();  // always UTF-8
const decoder = new TextDecoder();  // UTF-8 by default

const bytes = encoder.encode("Hello 😀");
// Uint8Array: [72, 101, 108, 108, 111, 32, 240, 159, 152, 128]
// Note: emoji is 4 bytes in UTF-8

const back = decoder.decode(bytes);  // "Hello 😀"

// For other encodings:
const win1252 = new TextDecoder('windows-1252');

Use the SymbolFYI Character Counter to see the code point, UTF-16 encoding, and grapheme cluster breakdown of any character or string. The Encoding Converter shows byte representations across UTF-8, UTF-16, and UTF-32.

Quick Reference

Task	Method	Notes
Code point count	`[...str].length`	Better than `.length`
Grapheme count	`Intl.Segmenter`	Most accurate for UI
Iterate characters	`for...of`	Handles surrogates
Get code point	`codePointAt(0)`	Use instead of `charCodeAt`
Build from code point	`String.fromCodePoint()`	Use instead of `fromCharCode`
Normalize	`str.normalize('NFC')`	Always before comparison
Truncate safely	Custom with Segmenter	Avoid splitting sequences

Next in Series: Python and Unicode: The Complete Developer's Guide — Python's str/bytes model, the unicodedata module, encoding/decoding, and the Unicode sandwich pattern for robust text processing.