SymbolFYI

Unicode Collation

Programming & Dev

Определение

Sorting text according to language-specific rules using the Unicode Collation Algorithm (UCA, UTS #10).

Unicode Collation Algorithm

The Unicode Collation Algorithm (UCA), defined in Unicode Technical Standard #10, provides a standardized method for comparing and sorting Unicode strings in a language-sensitive manner. Correct Unicode sorting requires more than byte-order or code-point-order comparison: it must handle accented characters, case differences, punctuation, numbers, and script-specific rules.

Why Simple String Comparison Fails

# Code-point order gives wrong results for natural language
words = ['cote', 'côte', 'coté', 'côté']
sorted(words)  # ['cote', 'coté', 'côte', 'côté'] — OK for English

# But for German ß:
words = ['Straße', 'Strasse', 'Strand']
sorted(words)  # ['Strasse', 'Strand', 'Straße'] — wrong in German

# German: ß should sort as 'ss', placing 'Straße' with 'Strasse'

The Four-Level Comparison

UCA compares strings at up to four levels:

Primary weight: Base character identity (a ≠ b, a = á ignoring accent)
Secondary weight: Accents and diacritics (a = a, á ≠ a)
Tertiary weight: Case and variant forms (A ≠ a)
Quaternary weight: Punctuation and special characters

This allows sorting that groups accented variants near their base characters, with case differences being a minor distinction.

Language-Specific Tailoring

UCA can be tailored for specific languages. Common examples:

Swedish: ä sorts after z, not near a
Spanish: traditionally treated ch and ll as single letters
German phonebook: ä sorts as ae
Danish/Norwegian: å sorts after z

Using Collation in Practice

Python

import locale

# System locale collation (limited)
locale.setlocale(locale.LC_ALL, 'de_DE.UTF-8')
sorted(['Müller', 'Mueller', 'Mütze'], key=locale.strxfrm)

# Better: PyICU for full UCA support
import icu
collator = icu.Collator.createInstance(icu.Locale('de'))
sorted(words, key=collator.getSortKey)

# Better still: babel for simple cases
from babel import Locale
from babel.core import get_official_languages

JavaScript

// Intl.Collator provides UCA-compliant sorting
const words = ['côte', 'coté', 'cote', 'côté'];

// French sorting
words.sort(new Intl.Collator('fr').compare);
// ['cote', 'côte', 'coté', 'côté']

// German phonebook sort (ä as ae)
const de = new Intl.Collator('de-DE-u-co-phonebk');
['Müller', 'Mueller', 'Mütze'].sort(de.compare);

// Case-insensitive, accent-insensitive
const loose = new Intl.Collator('en', {
  sensitivity: 'base'  // Only primary differences matter
});
loose.compare('café', 'cafe')  // 0 (equal at base sensitivity)

SQL / Databases

-- MySQL: use utf8mb4 with ICU collations for UCA support
CREATE TABLE words (
  word VARCHAR(100) COLLATE utf8mb4_unicode_ci
);

-- PostgreSQL: specify collation per query
SELECT word FROM words ORDER BY word COLLATE "de-DE-x-icu";

-- Create a column with German collation
CREATE TABLE words (
  word TEXT COLLATE "de-DE-x-icu"
);

Collation in Search

Search indexes that need to find "café" when the user types "cafe" require accent-insensitive collation:

# Normalize to ASCII for accent-insensitive comparison
import unicodedata

def strip_accents(text):
    nfd = unicodedata.normalize('NFD', text)
    return ''.join(c for c in nfd if unicodedata.category(c) != 'Mn')

strip_accents('café')   # 'cafe'
strip_accents('Ångström')  # 'Angstrom'