Unicode Collation: How to Sort Text Correctly Across Languages
- ○ 1. HTML Entities: The Complete Guide to Character References
- ○ 2. CSS Content Property: Using Unicode Symbols in Stylesheets
- ○ 3. Unicode-Aware Regex: Property Escapes and Multilingual Patterns
- ○ 4. JavaScript and Unicode: Strings, Code Points, and Grapheme Clusters
- ○ 5. Python and Unicode: The Complete Developer's Guide
- ○ 6. Unicode in URLs: Percent-Encoding, Punycode, and IRIs
- ○ 7. IDN Homograph Attacks: When Unicode Becomes a Security Threat
- ○ 8. Web Fonts and Unicode Subsetting: Loading Only What You Need
- ○ 9. Character Encoding Detection: How to Identify Unknown Text Encoding
- ● 10. Unicode Collation: How to Sort Text Correctly Across Languages
Sorting text sounds trivial — until it isn't. The fundamental problem is that there is no universal ordering for characters. Swedish puts Å after Z. Spanish traditionally treats ch and ll as single letters. German sorts ö as if it were oe in some contexts and as o in others. Japanese has three scripts to interleave. What appears to be a simple comparison involves language rules, diacritic handling, case folding, and multi-character collation elements.
Sorting raw Unicode code points produces results that are wrong for nearly every language except ASCII-only English.
Why Code Point Order Fails
// Sorting by code point value (JavaScript default):
['Ångström', 'apple', 'Avocado', 'äpple', 'banana'].sort()
// ['Avocado', 'Ångström', 'apple', 'äpple', 'banana']
// Wrong for English: uppercase before lowercase, Å in wrong position
['résumé', 'resume', 'Résumé'].sort()
// ['Résumé', 'resume', 'résumé'] — arbitrary order
['cafe', 'café', 'cafeteria'].sort()
// ['cafe', 'cafeteria', 'café'] — wrong: café should sort near cafe
The root cause: code point order is an accident of history. A is 65, Z is 90, a is 97. Capital letters come before lowercase. Å (U+00C5) comes after Z (U+005A) but before a (U+0061). Accented letters are scattered through the Latin character blocks based on when they were added to the standard.
The Unicode Collation Algorithm (UCA)
The UCA (Unicode Technical Standard #10) defines a standard algorithm for comparing strings in a culturally appropriate way. Its key concepts:
Collation elements: Each character (or sequence of characters) maps to a list of numeric collation elements, each with three levels:
- Primary weight: alphabetic ordering (a = á = â at this level)
- Secondary weight: diacritic differences (a ≠ á)
- Tertiary weight: case differences (a ≠ A)
Multi-level comparison: Two strings are first compared by primary weights only; if equal, by secondary weights; if still equal, by tertiary weights. This means resume sorts before résumé (same primary, different secondary), and cafe sorts before Cafe (same primary, same secondary, different tertiary).
Locale tailoring: The UCA base order can be tailored for specific languages. Swedish moves Å, Ä, Ö to after Z. Traditional Spanish treated ch as a single unit between c and d.
JavaScript: Intl.Collator
// Default locale sort
const fruits = ['résumé', 'Resume', 'resume', 'café', 'cafe'];
// Wrong — code point order
fruits.sort();
// ['Resume', 'cafe', 'café', 'resume', 'résumé']
// Correct — locale-aware
fruits.sort(new Intl.Collator('en').compare);
// ['cafe', 'café', 'Resume', 'resume', 'résumé']
// Note: accented variants group near base letters
// Case-insensitive with diacritics collapsed
fruits.sort(new Intl.Collator('en', {
sensitivity: 'base' // only primary differences matter
}).compare);
// ['cafe', 'café', 'resume', 'Resume', 'résumé'] — all group together
Intl.Collator options
// sensitivity: controls which differences are considered
// 'base' — only primary (letters: a ≠ b, but a = á = A)
// 'accent' — primary + secondary (a ≠ á, but a = A)
// 'case' — primary + tertiary (a ≠ A, but a = á)
// 'variant' — all differences (default)
// caseFirst: where uppercase appears relative to lowercase
// 'upper', 'lower', or 'false' (locale default)
// numeric: sort "10" > "9" instead of "10" < "9"
// ignorePunctuation: skip punctuation in comparison
const collator = new Intl.Collator('en', {
sensitivity: 'accent',
caseFirst: 'upper',
numeric: true,
ignorePunctuation: false,
});
['file10', 'file2', 'file1'].sort(collator.compare);
// ['file1', 'file2', 'file10'] — numeric: true handles this correctly
Locale-specific sorting
// Swedish: Å Ä Ö come after Z
const swedish = ['Öl', 'Zebra', 'Åsa', 'apple'];
swedish.sort(new Intl.Collator('sv').compare);
// ['apple', 'Zebra', 'Åsa', 'Öl']
// German: ö sorts as 'o' by default (phonebook sort: ö as 'oe')
const german = ['Müller', 'Mueller', 'Moser'];
german.sort(new Intl.Collator('de').compare);
// ['Moser', 'Mueller', 'Müller'] — standard order
german.sort(new Intl.Collator('de', { collation: 'phonebk' }).compare);
// ['Moser', 'Müller', 'Mueller'] — phonebook: ü sorts as ue, comes after Müller
// Japanese: complex — see below
const japanese = ['東京', 'あいうえお', 'アイウエオ', 'ABC'];
japanese.sort(new Intl.Collator('ja').compare);
// Locale-aware Japanese ordering
Performance optimization with localeCompare
For large datasets, create a Collator once and reuse it:
// Slow: new Collator created for every comparison
array.sort((a, b) => a.localeCompare(b, 'en'));
// Fast: collator created once
const collator = new Intl.Collator('en', { sensitivity: 'accent' });
array.sort(collator.compare);
// Even faster for large arrays: Schwartzian transform
// (sort key computed once per item, not once per comparison)
const sorted = array
.map(item => ({ item, key: item.normalize('NFC').toLowerCase() }))
.sort((a, b) => collator.compare(a.key, b.key))
.map(({ item }) => item);
Python: locale Module and PyICU
Python's built-in locale module provides locale-aware string comparison, but it requires a system locale to be installed:
import locale
# Set locale for sorting
locale.setlocale(locale.LC_COLLATE, 'en_US.UTF-8')
words = ['résumé', 'Resume', 'resume', 'café', 'cafe']
# locale-aware sort key
sorted_words = sorted(words, key=locale.strxfrm)
# ['cafe', 'café', 'resume', 'Resume', 'résumé']
# The setlocale approach has issues:
# - Not thread-safe (global state)
# - Requires system locale to be installed
# - Limited to locales available on the host OS
For production use, PyICU (Python bindings for ICU — International Components for Unicode) is the right tool:
pip install pyicu
import icu # PyICU
# Basic locale collation
collator_en = icu.Collator.createInstance(icu.Locale('en_US'))
words = ['résumé', 'Resume', 'resume', 'café', 'cafe']
sorted_words = sorted(words, key=collator_en.getSortKey)
# ['cafe', 'café', 'resume', 'Resume', 'résumé']
# Swedish
collator_sv = icu.Collator.createInstance(icu.Locale('sv_SE'))
swedish = ['Öl', 'Zebra', 'Åsa', 'apple']
sorted(swedish, key=collator_sv.getSortKey)
# ['apple', 'Zebra', 'Åsa', 'Öl']
# German phonebook
collator_de_phonebk = icu.Collator.createInstance(
icu.Locale('de@collation=phonebook')
)
# Collator attributes
collator_en.setAttribute(
icu.UCollAttribute.CASE_FIRST,
icu.UCollAttributeValue.UPPER_FIRST
)
collator_en.setStrength(icu.Collator.SECONDARY) # ignore case
Without PyICU: babel for basic needs
from babel import Locale
# babel provides CLDR-based locale data but limited collation
# Useful for locale metadata, not full UCA collation
locale = Locale('de', 'DE')
print(locale.english_name) # 'German (Germany)'
PostgreSQL: ICU Collations
PostgreSQL 10+ supports ICU collations, which implement the full Unicode Collation Algorithm:
-- Create a database with ICU collation
CREATE DATABASE myapp
WITH ENCODING = 'UTF8'
LC_COLLATE = 'en-US-x-icu'
LC_CTYPE = 'en-US-x-icu'
TEMPLATE = template0;
-- Create a column with specific ICU collation
CREATE TABLE products (
id SERIAL PRIMARY KEY,
name TEXT COLLATE "en-US-x-icu",
name_de TEXT COLLATE "de-DE-x-icu",
name_sv TEXT COLLATE "sv-SE-x-icu"
);
-- Sort with specific collation inline
SELECT name FROM products
ORDER BY name COLLATE "sv-SE-x-icu";
-- Case-insensitive comparison using ICU
SELECT name FROM products
WHERE name = 'München' COLLATE "de-DE-x-icu";
-- Matches 'münchen', 'MÜNCHEN', etc. depending on collation strength
ICU collation determinism in PostgreSQL
ICU collations are non-deterministic by default, which prevents their use in unique indexes:
-- Non-deterministic ICU collation (supports case-insensitive uniqueness)
CREATE COLLATION case_insensitive (
provider = icu,
locale = 'und-u-ks-level2', -- Unicode locale extension: strength=level2
deterministic = false
);
CREATE TABLE users (
email TEXT COLLATE case_insensitive UNIQUE -- case-insensitive unique constraint
);
-- Deterministic collation (can be used in unique indexes, but no case folding)
CREATE COLLATION en_natural (
provider = icu,
locale = 'en-US-u-kn-true', -- kn=true: numeric ordering
deterministic = true
);
Natural sort in PostgreSQL
-- Sort file names with numeric parts naturally:
-- file1, file2, file10 instead of file1, file10, file2
CREATE COLLATION numeric_sort (
provider = icu,
locale = 'en-US-u-kn-true',
deterministic = true
);
SELECT filename FROM files
ORDER BY filename COLLATE numeric_sort;
-- file1, file2, file10, file20, file100
CLDR: The Data Behind Collation
The Unicode Common Locale Data Repository (CLDR) is the reference dataset for locale-aware operations. It defines:
- Sort orders for each locale
- Locale-specific rules (German phonebook vs. standard)
- Character class memberships
- Script ordering for mixed-script text
CLDR is what powers Intl.Collator, ICU, and most other internationalization libraries. When you specify 'sv' to Intl.Collator, the browser consults CLDR to learn Swedish sorting rules.
Common Pitfalls
Sorting objects by a text property
// Wrong
const products = [{name: 'Café'}, {name: 'cafe'}, {name: 'apple'}];
products.sort((a, b) => a.name > b.name ? 1 : -1); // code point comparison
// Correct
const collator = new Intl.Collator('en');
products.sort((a, b) => collator.compare(a.name, b.name));
Normalization before collation
// Different normalization forms may sort differently
// Always normalize before collating:
const normalize = str => str.normalize('NFC');
const collator = new Intl.Collator('en');
array.sort((a, b) => collator.compare(normalize(a), normalize(b)));
Database collation vs. application collation
Sorting in the database and sorting in application code may produce different results if they use different collation implementations. For consistency, pick one layer for sorting and stick to it. For paginated sorted results, always sort in the database.
The "C" collation in PostgreSQL
The default C collation in PostgreSQL sorts by byte value — the same as code point order for UTF-8. It is fast but linguistically wrong for most use cases. If you need locale-aware sorting, you must explicitly use an ICU or libc collation.
-- Fast but linguistically wrong for non-ASCII:
SELECT * FROM users ORDER BY name COLLATE "C";
-- Correct for English:
SELECT * FROM users ORDER BY name COLLATE "en-US-x-icu";
-- Check current collation:
SELECT datcollate FROM pg_database WHERE datname = current_database();
Use the SymbolFYI Character Counter to inspect the exact code points in strings that are sorting unexpectedly — invisible characters, combining marks, and normalization issues are common culprits.
This concludes the Symbols for Developers series. From HTML entities through regex, JavaScript string internals, Python's Unicode model, URL encoding, security threats, font optimization, encoding detection, and collation — you now have a complete toolkit for working with Unicode across the web stack.