Grapheme Clusters: Why String Length Is More Complicated Than You Think

Reference Jui 20, 2023

Table des matières

You probably learned that a string's length is the number of characters it contains. In practice, this assumption breaks in at least three common ways: accented letters that are stored as two code points, emoji that report a length of 2 in JavaScript, and family emoji that can be over 10 code points long. The concept that unifies all of these is the grapheme cluster.

The Problem: String Length Lies

Consider this simple JavaScript experiment:

'café'.length        // 4 or 5? Depends on normalization
'👍'.length          // 2 (not 1!)
'👍🏽'.length         // 4 (thumbs up + medium skin tone)
'👨‍👩‍👧‍👦'.length    // 11 (family emoji)
'e\u0301'.length     // 2 (e + combining acute accent)

The family emoji 👨‍👩‍👧‍👦 is a single visual unit — one thing a user would select, delete, or count. But it reports a .length of 11 in JavaScript because it is composed of 11 UTF-16 code units. In Python 3, len('👨‍👩‍👧‍👦') returns 7 — better (Python counts code points, not code units), but still not 1.

This is the grapheme cluster problem: the user's intuitive notion of "one character" does not align with what programming languages count.

What Is a Grapheme Cluster?

A grapheme cluster is a sequence of one or more Unicode code points that should be treated as a single unit of text from the user's perspective. It is defined by Unicode Standard Annex #29, the Unicode Text Segmentation specification.

The simplest grapheme cluster is a single code point with no modifiers. Most Latin letters, digits, and common symbols are single-code-point grapheme clusters. The complexity arises in three categories:

1. Combining character sequences — a base character followed by one or more combining marks:

é = e (U+0065) + ◌́ (U+0301 combining acute accent) = 1 grapheme cluster, 2 code points
ñ = n (U+006E) + ◌̃ (U+0303 combining tilde) = 1 grapheme cluster, 2 code points
ạ̄ = a + combining macron + combining dot below = 1 grapheme cluster, 3 code points

2. Emoji modifier sequences — an emoji base followed by a skin tone modifier:

👍🏽 = 👍 (U+1F44D) + 🏽 (U+1F3FD medium skin tone) = 1 grapheme cluster, 2 code points

3. Emoji ZWJ sequences — multiple emoji joined by U+200D Zero Width Joiner:

👨‍👩‍👧‍👦 = 👨 + ZWJ + 👩 + ZWJ + 👧 + ZWJ + 👦 = 1 grapheme cluster, 7 code points
❤️‍🔥 = ❤ + variation selector-16 + ZWJ + 🔥 = 1 grapheme cluster, 4 code points

Combining Characters in Depth

Unicode supports precomposed and decomposed forms of accented characters. The letter é can be represented two ways:

Form	Code Points	Description
Precomposed	U+00E9	LATIN SMALL LETTER E WITH ACUTE (single code point)
Decomposed	U+0065 + U+0301	e + combining acute accent (two code points)

Both forms look identical when rendered. Whether your string uses one or two code points depends on the normalization form. NFC (Canonical Decomposition followed by Canonical Composition) prefers precomposed forms; NFD (Canonical Decomposition) decomposes them.

The word "café" can therefore be 4 or 5 code points depending on which form is used for the é. Both are valid Unicode; they represent the same abstract text.

Combining characters sit in the Combining Diacritical Marks block (U+0300–U+036F) for the most common accents, with additional blocks for specialized combining marks used in phonetic notation, medieval manuscripts, and other scripts.

Emoji Sequences: A Case Study

The family emoji 👨‍👩‍👧‍👦 demonstrates how far grapheme clusters can stretch. It decomposes as:

Code Point	Character	Name
U+1F468	👨	MAN
U+200D	‍	ZERO WIDTH JOINER
U+1F469	👩	WOMAN
U+200D	‍	ZERO WIDTH JOINER
U+1F467	👧	GIRL
U+200D	‍	ZERO WIDTH JOINER
U+1F466	👦	BOY

That is 7 code points, 11 UTF-16 code units (because each emoji above U+FFFF takes 2 code units in UTF-16). A user sees and interacts with this as one character.

Other notable multi-code-point visual units: - Flag emoji: Two Regional Indicator letters, e.g., 🇺🇸 = U+1F1FA + U+1F1F8 - Keycap emoji: Digit + variation selector + combining enclosing keycap, e.g., 1️⃣ = 3 code points - Person + profession: e.g., 👩‍💻 = U+1F469 + ZWJ + U+1F4BB (2 emoji + ZWJ)

Counting Grapheme Clusters Correctly

JavaScript: Intl.Segmenter

The modern solution in JavaScript is Intl.Segmenter, available in all modern browsers and Node.js 16+:

function countGraphemes(str) {
  const segmenter = new Intl.Segmenter();
  return [...segmenter.segment(str)].length;
}

countGraphemes('café')          // 4 (regardless of normalization)
countGraphemes('👍🏽')           // 1
countGraphemes('👨‍👩‍👧‍👦')   // 1
countGraphemes('Hello')         // 5

// Iterate over grapheme clusters
const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
for (const { segment } of segmenter.segment('café 👋🏼')) {
  console.log(JSON.stringify(segment));
}
// "c" "a" "f" "é" " " "👋🏼"

Python: grapheme library

Python 3 counts code points with len(), which is closer to correct but still wrong for ZWJ sequences. For accurate grapheme counting, use the grapheme library:

import grapheme

# Code point count vs grapheme count
len('👨‍👩‍👧‍👦')              # 7 (code points)
grapheme.length('👨‍👩‍👧‍👦')   # 1

len('café')                     # 4 or 5
grapheme.length('café')          # 4

# Iterate over grapheme clusters
list(grapheme.graphemes('👨‍👩‍👧‍👦 hello'))
# ['👨‍👩‍👧‍👦', ' ', 'h', 'e', 'l', 'l', 'o']

# Safe slice by grapheme
grapheme.slice('café 👍🏽 world', 0, 3)   # 'caf'
grapheme.slice('café 👍🏽 world', 5, 6)   # '👍🏽'

Install with: pip install grapheme

Ruby, Swift, and others

Language	Approach
Ruby	`'café'.chars` returns grapheme clusters in Ruby 2.0+
Swift	`"café".count` correctly counts grapheme clusters (built-in)
Go	`utf8.RuneCountInString()` counts code points; use `golang.org/x/text/unicode/norm` for grapheme clusters
Rust	`unicode-segmentation` crate provides `graphemes()` iterator
Java	`BreakIterator.getCharacterInstance()` segments by grapheme cluster

Swift is notably user-friendly here — its String.count property returns the number of grapheme clusters by default, matching user expectation.

Common String Operations Gone Wrong

Truncation

Naively truncating by index or code point count can split a grapheme cluster:

// WRONG — may split combining character or emoji sequence
function truncateBad(str, maxLength) {
  return str.slice(0, maxLength);
}

// CORRECT — truncate by grapheme cluster
function truncateByGrapheme(str, maxGraphemes) {
  const segmenter = new Intl.Segmenter();
  const segments = [...segmenter.segment(str)];
  return segments.slice(0, maxGraphemes).map(s => s.segment).join('');
}

truncateBad('👍🏽 hello', 1)         // '\uD83D' (broken surrogate)
truncateByGrapheme('👍🏽 hello', 1)  // '👍🏽'

Reversal

Reversing a string by splitting on code points or code units breaks combining characters and emoji sequences:

// WRONG
'café 👋🏼'.split('').reverse().join('')
// Garbled — breaks combining accents and emoji skin tones

// CORRECT
function reverseGraphemes(str) {
  const segmenter = new Intl.Segmenter();
  return [...segmenter.segment(str)]
    .map(s => s.segment)
    .reverse()
    .join('');
}

reverseGraphemes('café 👋🏼')  // '🏼👋 éfac'

Substring operations

When extracting substrings by user-visible position, always convert to grapheme cluster arrays first, operate on the array, then join:

function graphemeSubstring(str, start, end) {
  const segmenter = new Intl.Segmenter();
  return [...segmenter.segment(str)]
    .slice(start, end)
    .map(s => s.segment)
    .join('');
}

Grapheme Clusters and User Interfaces

Most user interface frameworks handle grapheme clusters correctly for display and input: - Cursor movement in a text field skips over combining characters and emoji sequences as a unit - Backspace deletes an entire grapheme cluster - Click-to-select highlights whole grapheme clusters - Copy/paste preserves grapheme clusters intact

The place where developers most commonly encounter grapheme cluster bugs is in server-side validation (checking string length) and string manipulation (truncation, substring, reversal). These operations happen in code, not in the UI, so the framework's handling does not protect you.

Character counting in forms

If you have a form field with a maximum character count displayed to the user ("280 characters remaining"), count by grapheme clusters so the number shown matches what the user sees:

const textarea = document.querySelector('textarea');
const counter = document.querySelector('.char-count');
const segmenter = new Intl.Segmenter();

textarea.addEventListener('input', () => {
  const count = [...segmenter.segment(textarea.value)].length;
  counter.textContent = `${280 - count} characters remaining`;
});

The SymbolFYI Character Counter tool counts both code points and grapheme clusters so you can see the difference for any input.

Quick Reference

Scenario	Code Points	Grapheme Clusters
`A`	1	1
`é` (precomposed)	1	1
`é` (decomposed)	2	1
`👍`	1	1
`👍🏽` (with skin tone)	2	1
`👨‍👩‍👧‍👦` (family)	7	1
`🇺🇸` (flag)	2	1
`1️⃣` (keycap)	3	1
`café` (NFC)	4	4
`café` (NFD)	5	4

The rule of thumb: when you are measuring, displaying, or manipulating text based on what users see, count grapheme clusters. When you are working with encoding (bytes on the wire, code unit positions, surrogate pairs), count code units at the appropriate level.