Unicode Collation Algorithm
The Unicode Collation Algorithm (UCA), defined in Unicode Technical Standard #10, provides a standardized method for comparing and sorting Unicode strings in a language-sensitive manner. Correct Unicode sorting requires more than byte-order or code-point-order comparison: it must handle accented characters, case differences, punctuation, numbers, and script-specific rules.
Why Simple String Comparison Fails
# Code-point order gives wrong results for natural language
words = ['cote', 'côte', 'coté', 'côté']
sorted(words) # ['cote', 'coté', 'côte', 'côté'] — OK for English
# But for German ß:
words = ['Straße', 'Strasse', 'Strand']
sorted(words) # ['Strasse', 'Strand', 'Straße'] — wrong in German
# German: ß should sort as 'ss', placing 'Straße' with 'Strasse'
The Four-Level Comparison
UCA compares strings at up to four levels:
- Primary weight: Base character identity (a ≠ b, a = á ignoring accent)
- Secondary weight: Accents and diacritics (a = a, á ≠ a)
- Tertiary weight: Case and variant forms (A ≠ a)
- Quaternary weight: Punctuation and special characters
This allows sorting that groups accented variants near their base characters, with case differences being a minor distinction.
Language-Specific Tailoring
UCA can be tailored for specific languages. Common examples:
- Swedish:
äsorts afterz, not neara - Spanish: traditionally treated
chandllas single letters - German phonebook:
äsorts asae - Danish/Norwegian:
åsorts afterz
Using Collation in Practice
Python
import locale
# System locale collation (limited)
locale.setlocale(locale.LC_ALL, 'de_DE.UTF-8')
sorted(['Müller', 'Mueller', 'Mütze'], key=locale.strxfrm)
# Better: PyICU for full UCA support
import icu
collator = icu.Collator.createInstance(icu.Locale('de'))
sorted(words, key=collator.getSortKey)
# Better still: babel for simple cases
from babel import Locale
from babel.core import get_official_languages
JavaScript
// Intl.Collator provides UCA-compliant sorting
const words = ['côte', 'coté', 'cote', 'côté'];
// French sorting
words.sort(new Intl.Collator('fr').compare);
// ['cote', 'côte', 'coté', 'côté']
// German phonebook sort (ä as ae)
const de = new Intl.Collator('de-DE-u-co-phonebk');
['Müller', 'Mueller', 'Mütze'].sort(de.compare);
// Case-insensitive, accent-insensitive
const loose = new Intl.Collator('en', {
sensitivity: 'base' // Only primary differences matter
});
loose.compare('café', 'cafe') // 0 (equal at base sensitivity)
SQL / Databases
-- MySQL: use utf8mb4 with ICU collations for UCA support
CREATE TABLE words (
word VARCHAR(100) COLLATE utf8mb4_unicode_ci
);
-- PostgreSQL: specify collation per query
SELECT word FROM words ORDER BY word COLLATE "de-DE-x-icu";
-- Create a column with German collation
CREATE TABLE words (
word TEXT COLLATE "de-DE-x-icu"
);
Collation in Search
Search indexes that need to find "café" when the user types "cafe" require accent-insensitive collation:
# Normalize to ASCII for accent-insensitive comparison
import unicodedata
def strip_accents(text):
nfd = unicodedata.normalize('NFD', text)
return ''.join(c for c in nfd if unicodedata.category(c) != 'Mn')
strip_accents('café') # 'cafe'
strip_accents('Ångström') # 'Angstrom'