SymbolFYI

Grapheme Cluster

Unicode Standard
परिभाषा

A user-perceived character that may consist of multiple code points (e.g., a base character + combining marks, or a flag emoji).

What Is a Grapheme Cluster?

A grapheme cluster represents what a user perceives as a single character — the atomic unit of visible text from a human perspective. A grapheme cluster may consist of a single Unicode code point, or it may require multiple code points combined together. This distinction between code points and grapheme clusters is one of the most practically important concepts for developers working with text.

Unicode defines grapheme clusters precisely in Unicode Standard Annex #29 (Text Segmentation). The algorithm specifies exactly which sequences of code points form a single grapheme cluster.

Why Grapheme Clusters Matter

Consider the string é. Depending on how it was created, it may be: 1. One code point: U+00E9 (precomposed é) 2. Two code points: U+0065 (e) + U+0301 (combining acute accent)

Both display identically, but naive code-point-based length functions return 1 and 2 respectively. This matters for: - String length displayed to users - Cursor movement in text editors - Text selection and highlighting - Substring operations (slicing mid-grapheme causes rendering artifacts) - Character count limits in forms

Types of Multi-Code-Point Grapheme Clusters

Base + Combining Marks

ą = a (U+0061) + combining ogonek (U+0328)

Emoji + Skin Tone Modifier

👋🏾 = waving hand (U+1F44B) + medium-dark skin tone (U+1F3FE)

ZWJ Sequences

👨‍💻 = man (U+1F468) + ZWJ (U+200D) + laptop (U+1F4BB)

Regional Indicator Sequences (Flags)

🇺🇸 = regional indicator U (U+1F1FA) + regional indicator S (U+1F1F8)

Hangul Jamo

In Korean, individual consonant and vowel jamo can be combined into a syllable block: + + = (one grapheme cluster).

Grapheme Cluster Iteration in Code

// Wrong: iterates by code points (breaks emoji sequences)
const text = '👋🏾 Hello';
console.log([...text].length);  // 8 (code points)

// Correct: Intl.Segmenter (ES2021)
const segmenter = new Intl.Segmenter();
const graphemes = [...segmenter.segment(text)];
console.log(graphemes.length);  // 7 (grapheme clusters)
console.log(graphemes[0].segment);  // '👋🏾' (one cluster)

// Get visible character count
function visibleLength(str) {
  return [...new Intl.Segmenter().segment(str)].length;
}
# Standard library: no built-in grapheme cluster support
# Use the 'grapheme' package: pip install grapheme
import grapheme

text = '👋🏾 Hello'
print(grapheme.length(text))         # 7
print(list(grapheme.graphemes(text))) # ['👋🏾', ' ', 'H', 'e', 'l', 'l', 'o']

# Or use the 'regex' module
import regex
clusters = regex.findall(r'\X', text)
print(len(clusters))   # 7
print(clusters[0])     # '👋🏾'

Grapheme Clusters vs. Characters vs. Bytes

Concept What it counts Example for é (NFD)
Bytes (UTF-8) Storage units 3 bytes
Code units (UTF-16) 16-bit units 2 code units
Code points Unicode numbers 2 code points
Grapheme clusters Perceived characters 1 cluster

संबंधित प्रतीक

संबंधित शब्द

संबंधित टूल

संबंधित गाइड