Unicode Text Segmentation
Unicode text segmentation defines algorithms for breaking text into meaningful units: grapheme clusters (user-perceived characters), words, sentences, and lines. These boundaries are not simply space or punctuation characters—they depend on Unicode character properties and script-specific rules defined in Unicode Standard Annex #29.
Grapheme Clusters
A grapheme cluster is the smallest unit of text as perceived by the user. It may consist of multiple Unicode code points:
Simple character: 'A' = [U+0041]
Accented: 'é' = [U+00E9] or [U+0065, U+0301]
Emoji: '😀' = [U+1F600]
Keycap: '1️⃣' = [U+0031, U+FE0F, U+20E3]
Family emoji: '👨👩👧👦' = [U+1F468, U+200D, U+1F469, U+200D, U+1F467, U+200D, U+1F466]
Flagged: '🇺🇸' = [U+1F1FA, U+1F1F8]
JavaScript: Intl.Segmenter
const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
const text = '👨👩👧👦café';
const graphemes = [...segmenter.segment(text)].map(s => s.segment);
// ['👨👩👧👦', 'c', 'a', 'f', 'é']
console.log(graphemes.length); // 5
// Grapheme-aware string reversal
function reverseString(str) {
const seg = new Intl.Segmenter();
return [...seg.segment(str)].map(s => s.segment).reverse().join('');
}
reverseString('café') // 'éfac'
Python: grapheme package
import grapheme
text = '👨👩👧👦café'
grapheme.length(text) # 5
list(grapheme.graphemes(text)) # ['👨👩👧👦', 'c', 'a', 'f', 'é']
# Grapheme-safe slicing
grapheme.slice(text, 0, 2) # '👨👩👧👦c'
Word Segmentation
Word boundaries are defined by Unicode Word Boundary Rules (UAX #29). They account for apostrophes in contractions, hyphens, and script-specific rules:
const wordSegmenter = new Intl.Segmenter('en', { granularity: 'word' });
const words = [...wordSegmenter.segment("Don't panic!")]
.filter(s => s.isWordLike)
.map(s => s.segment);
// ["Don't", 'panic']
// Japanese word segmentation (no spaces)
const jaSegmenter = new Intl.Segmenter('ja', { granularity: 'word' });
[...jaSegmenter.segment('日本語のテキスト')]
.filter(s => s.isWordLike)
.map(s => s.segment);
// ['日本語', 'の', 'テキスト'] (approximately)
Sentence Segmentation
const sentenceSegmenter = new Intl.Segmenter('en', { granularity: 'sentence' });
const text = 'Dr. Smith visited Washington D.C. last Tuesday. It was hot.';
[...sentenceSegmenter.segment(text)].map(s => s.segment);
// ['Dr. Smith visited Washington D.C. last Tuesday. ', 'It was hot.']
Line Breaking
Unicode Line Breaking Algorithm (UAX #14) determines where text may be broken across lines. This differs from word segmentation—line breaks can occur at positions other than spaces:
# The 'uam' package implements UAX #14 in Python
from uam import linebreak
opportunities = linebreak.find_opportunities('hello world')
# Returns positions where line breaks are allowed
Browsers implement the Unicode line breaking algorithm natively for text rendering.
Why Custom Segmentation Fails
Naive approaches to text segmentation break for common scripts:
- Thai, Khmer, Burmese: No spaces between words; word detection requires dictionary lookup
- Chinese, Japanese: No spaces; requires language model or dictionary
- Arabic: Connecting letters make character-level iteration non-trivial
- Indic scripts: Consonant clusters (akshara) span multiple code points
Always prefer Intl.Segmenter (JavaScript), icu4c/icu4j (C++/Java), or the grapheme package (Python) over custom splitting logic.