SymbolFYI

Grapheme Segmentation (UAX #29)

Programming & Dev

Definisi

The Unicode algorithm for splitting text into user-perceived characters, handling emoji sequences, combining marks, etc.

Unicode Text Segmentation

Unicode text segmentation defines algorithms for breaking text into meaningful units: grapheme clusters (user-perceived characters), words, sentences, and lines. These boundaries are not simply space or punctuation characters—they depend on Unicode character properties and script-specific rules defined in Unicode Standard Annex #29.

Grapheme Clusters

A grapheme cluster is the smallest unit of text as perceived by the user. It may consist of multiple Unicode code points:

Simple character:  'A' = [U+0041]
Accented:          'é' = [U+00E9] or [U+0065, U+0301]
Emoji:             '😀' = [U+1F600]
Keycap:            '1️⃣' = [U+0031, U+FE0F, U+20E3]
Family emoji:      '👨‍👩‍👧‍👦' = [U+1F468, U+200D, U+1F469, U+200D, U+1F467, U+200D, U+1F466]
Flagged:           '🇺🇸' = [U+1F1FA, U+1F1F8]

JavaScript: Intl.Segmenter

const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
const text = '👨‍👩‍👧‍👦café';

const graphemes = [...segmenter.segment(text)].map(s => s.segment);
// ['👨‍👩‍👧‍👦', 'c', 'a', 'f', 'é']
console.log(graphemes.length);  // 5

// Grapheme-aware string reversal
function reverseString(str) {
  const seg = new Intl.Segmenter();
  return [...seg.segment(str)].map(s => s.segment).reverse().join('');
}
reverseString('café')  // 'éfac'

Python: grapheme package

import grapheme

text = '👨‍👩‍👧‍👦café'
grapheme.length(text)          # 5
list(grapheme.graphemes(text)) # ['👨‍👩‍👧‍👦', 'c', 'a', 'f', 'é']

# Grapheme-safe slicing
grapheme.slice(text, 0, 2)     # '👨‍👩‍👧‍👦c'

Word Segmentation

Word boundaries are defined by Unicode Word Boundary Rules (UAX #29). They account for apostrophes in contractions, hyphens, and script-specific rules:

const wordSegmenter = new Intl.Segmenter('en', { granularity: 'word' });
const words = [...wordSegmenter.segment("Don't panic!")]
  .filter(s => s.isWordLike)
  .map(s => s.segment);
// ["Don't", 'panic']

// Japanese word segmentation (no spaces)
const jaSegmenter = new Intl.Segmenter('ja', { granularity: 'word' });
[...jaSegmenter.segment('日本語のテキスト')]
  .filter(s => s.isWordLike)
  .map(s => s.segment);
// ['日本語', 'の', 'テキスト'] (approximately)

Sentence Segmentation

const sentenceSegmenter = new Intl.Segmenter('en', { granularity: 'sentence' });
const text = 'Dr. Smith visited Washington D.C. last Tuesday. It was hot.';

[...sentenceSegmenter.segment(text)].map(s => s.segment);
// ['Dr. Smith visited Washington D.C. last Tuesday. ', 'It was hot.']

Line Breaking

Unicode Line Breaking Algorithm (UAX #14) determines where text may be broken across lines. This differs from word segmentation—line breaks can occur at positions other than spaces:

# The 'uam' package implements UAX #14 in Python
from uam import linebreak

opportunities = linebreak.find_opportunities('hello world')
# Returns positions where line breaks are allowed

Browsers implement the Unicode line breaking algorithm natively for text rendering.

Why Custom Segmentation Fails

Naive approaches to text segmentation break for common scripts:

Thai, Khmer, Burmese: No spaces between words; word detection requires dictionary lookup
Chinese, Japanese: No spaces; requires language model or dictionary
Arabic: Connecting letters make character-level iteration non-trivial
Indic scripts: Consonant clusters (akshara) span multiple code points

Always prefer Intl.Segmenter (JavaScript), icu4c/icu4j (C++/Java), or the grapheme package (Python) over custom splitting logic.

Istilah Terkait