Unicode Collation: How to Sort Text Correctly Across Languages

Web Development Symbols for Developers Июн 11, 2024

○ 1. HTML Entities: The Complete Guide to Character References
○ 2. CSS Content Property: Using Unicode Symbols in Stylesheets
○ 3. Unicode-Aware Regex: Property Escapes and Multilingual Patterns
○ 4. JavaScript and Unicode: Strings, Code Points, and Grapheme Clusters
○ 5. Python and Unicode: The Complete Developer's Guide
○ 6. Unicode in URLs: Percent-Encoding, Punycode, and IRIs
○ 7. IDN Homograph Attacks: When Unicode Becomes a Security Threat
○ 8. Web Fonts and Unicode Subsetting: Loading Only What You Need
○ 9. Character Encoding Detection: How to Identify Unknown Text Encoding
● 10. Unicode Collation: How to Sort Text Correctly Across Languages

Содержание

Sorting text sounds trivial — until it isn't. The fundamental problem is that there is no universal ordering for characters. Swedish puts Å after Z. Spanish traditionally treats ch and ll as single letters. German sorts ö as if it were oe in some contexts and as o in others. Japanese has three scripts to interleave. What appears to be a simple comparison involves language rules, diacritic handling, case folding, and multi-character collation elements.

Sorting raw Unicode code points produces results that are wrong for nearly every language except ASCII-only English.

Why Code Point Order Fails

// Sorting by code point value (JavaScript default):
['Ångström', 'apple', 'Avocado', 'äpple', 'banana'].sort()
// ['Avocado', 'Ångström', 'apple', 'äpple', 'banana']
// Wrong for English: uppercase before lowercase, Å in wrong position

['résumé', 'resume', 'Résumé'].sort()
// ['Résumé', 'resume', 'résumé']  — arbitrary order

['cafe', 'café', 'cafeteria'].sort()
// ['cafe', 'cafeteria', 'café']  — wrong: café should sort near cafe

The root cause: code point order is an accident of history. A is 65, Z is 90, a is 97. Capital letters come before lowercase. Å (U+00C5) comes after Z (U+005A) but before a (U+0061). Accented letters are scattered through the Latin character blocks based on when they were added to the standard.

The Unicode Collation Algorithm (UCA)

The UCA (Unicode Technical Standard #10) defines a standard algorithm for comparing strings in a culturally appropriate way. Its key concepts:

Collation elements: Each character (or sequence of characters) maps to a list of numeric collation elements, each with three levels: - Primary weight: alphabetic ordering (a = á = â at this level) - Secondary weight: diacritic differences (a ≠ á) - Tertiary weight: case differences (a ≠ A)

Multi-level comparison: Two strings are first compared by primary weights only; if equal, by secondary weights; if still equal, by tertiary weights. This means resume sorts before résumé (same primary, different secondary), and cafe sorts before Cafe (same primary, same secondary, different tertiary).

Locale tailoring: The UCA base order can be tailored for specific languages. Swedish moves Å, Ä, Ö to after Z. Traditional Spanish treated ch as a single unit between c and d.

JavaScript: `Intl.Collator`

// Default locale sort
const fruits = ['résumé', 'Resume', 'resume', 'café', 'cafe'];

// Wrong — code point order
fruits.sort();
// ['Resume', 'cafe', 'café', 'resume', 'résumé']

// Correct — locale-aware
fruits.sort(new Intl.Collator('en').compare);
// ['cafe', 'café', 'Resume', 'resume', 'résumé']
// Note: accented variants group near base letters

// Case-insensitive with diacritics collapsed
fruits.sort(new Intl.Collator('en', {
  sensitivity: 'base'   // only primary differences matter
}).compare);
// ['cafe', 'café', 'resume', 'Resume', 'résumé']  — all group together

`Intl.Collator` options

// sensitivity: controls which differences are considered
// 'base'     — only primary (letters: a ≠ b, but a = á = A)
// 'accent'   — primary + secondary (a ≠ á, but a = A)
// 'case'     — primary + tertiary (a ≠ A, but a = á)
// 'variant'  — all differences (default)

// caseFirst: where uppercase appears relative to lowercase
// 'upper', 'lower', or 'false' (locale default)

// numeric: sort "10" > "9" instead of "10" < "9"
// ignorePunctuation: skip punctuation in comparison

const collator = new Intl.Collator('en', {
  sensitivity: 'accent',
  caseFirst: 'upper',
  numeric: true,
  ignorePunctuation: false,
});

['file10', 'file2', 'file1'].sort(collator.compare);
// ['file1', 'file2', 'file10']  — numeric: true handles this correctly

Locale-specific sorting

// Swedish: Å Ä Ö come after Z
const swedish = ['Öl', 'Zebra', 'Åsa', 'apple'];
swedish.sort(new Intl.Collator('sv').compare);
// ['apple', 'Zebra', 'Åsa', 'Öl']

// German: ö sorts as 'o' by default (phonebook sort: ö as 'oe')
const german = ['Müller', 'Mueller', 'Moser'];
german.sort(new Intl.Collator('de').compare);
// ['Moser', 'Mueller', 'Müller']  — standard order

german.sort(new Intl.Collator('de', { collation: 'phonebk' }).compare);
// ['Moser', 'Müller', 'Mueller']  — phonebook: ü sorts as ue, comes after Müller

// Japanese: complex — see below
const japanese = ['東京', 'あいうえお', 'アイウエオ', 'ABC'];
japanese.sort(new Intl.Collator('ja').compare);
// Locale-aware Japanese ordering

Performance optimization with `localeCompare`

For large datasets, create a Collator once and reuse it:

// Slow: new Collator created for every comparison
array.sort((a, b) => a.localeCompare(b, 'en'));

// Fast: collator created once
const collator = new Intl.Collator('en', { sensitivity: 'accent' });
array.sort(collator.compare);

// Even faster for large arrays: Schwartzian transform
// (sort key computed once per item, not once per comparison)
const sorted = array
  .map(item => ({ item, key: item.normalize('NFC').toLowerCase() }))
  .sort((a, b) => collator.compare(a.key, b.key))
  .map(({ item }) => item);

Python: `locale` Module and `PyICU`

Python's built-in locale module provides locale-aware string comparison, but it requires a system locale to be installed:

import locale

# Set locale for sorting
locale.setlocale(locale.LC_COLLATE, 'en_US.UTF-8')

words = ['résumé', 'Resume', 'resume', 'café', 'cafe']

# locale-aware sort key
sorted_words = sorted(words, key=locale.strxfrm)
# ['cafe', 'café', 'resume', 'Resume', 'résumé']

# The setlocale approach has issues:
# - Not thread-safe (global state)
# - Requires system locale to be installed
# - Limited to locales available on the host OS

For production use, PyICU (Python bindings for ICU — International Components for Unicode) is the right tool:

pip install pyicu

import icu  # PyICU

# Basic locale collation
collator_en = icu.Collator.createInstance(icu.Locale('en_US'))
words = ['résumé', 'Resume', 'resume', 'café', 'cafe']
sorted_words = sorted(words, key=collator_en.getSortKey)
# ['cafe', 'café', 'resume', 'Resume', 'résumé']

# Swedish
collator_sv = icu.Collator.createInstance(icu.Locale('sv_SE'))
swedish = ['Öl', 'Zebra', 'Åsa', 'apple']
sorted(swedish, key=collator_sv.getSortKey)
# ['apple', 'Zebra', 'Åsa', 'Öl']

# German phonebook
collator_de_phonebk = icu.Collator.createInstance(
    icu.Locale('de@collation=phonebook')
)

# Collator attributes
collator_en.setAttribute(
    icu.UCollAttribute.CASE_FIRST,
    icu.UCollAttributeValue.UPPER_FIRST
)
collator_en.setStrength(icu.Collator.SECONDARY)  # ignore case

Without PyICU: `babel` for basic needs

from babel import Locale

# babel provides CLDR-based locale data but limited collation
# Useful for locale metadata, not full UCA collation
locale = Locale('de', 'DE')
print(locale.english_name)  # 'German (Germany)'

PostgreSQL: ICU Collations

PostgreSQL 10+ supports ICU collations, which implement the full Unicode Collation Algorithm:

-- Create a database with ICU collation
CREATE DATABASE myapp
  WITH ENCODING = 'UTF8'
       LC_COLLATE = 'en-US-x-icu'
       LC_CTYPE = 'en-US-x-icu'
       TEMPLATE = template0;

-- Create a column with specific ICU collation
CREATE TABLE products (
  id SERIAL PRIMARY KEY,
  name TEXT COLLATE "en-US-x-icu",
  name_de TEXT COLLATE "de-DE-x-icu",
  name_sv TEXT COLLATE "sv-SE-x-icu"
);

-- Sort with specific collation inline
SELECT name FROM products
ORDER BY name COLLATE "sv-SE-x-icu";

-- Case-insensitive comparison using ICU
SELECT name FROM products
WHERE name = 'München' COLLATE "de-DE-x-icu";
-- Matches 'münchen', 'MÜNCHEN', etc. depending on collation strength

ICU collation determinism in PostgreSQL

ICU collations are non-deterministic by default, which prevents their use in unique indexes:

-- Non-deterministic ICU collation (supports case-insensitive uniqueness)
CREATE COLLATION case_insensitive (
  provider = icu,
  locale = 'und-u-ks-level2',  -- Unicode locale extension: strength=level2
  deterministic = false
);

CREATE TABLE users (
  email TEXT COLLATE case_insensitive UNIQUE  -- case-insensitive unique constraint
);

-- Deterministic collation (can be used in unique indexes, but no case folding)
CREATE COLLATION en_natural (
  provider = icu,
  locale = 'en-US-u-kn-true',  -- kn=true: numeric ordering
  deterministic = true
);

Natural sort in PostgreSQL

-- Sort file names with numeric parts naturally:
-- file1, file2, file10 instead of file1, file10, file2
CREATE COLLATION numeric_sort (
  provider = icu,
  locale = 'en-US-u-kn-true',
  deterministic = true
);

SELECT filename FROM files
ORDER BY filename COLLATE numeric_sort;
-- file1, file2, file10, file20, file100

CLDR: The Data Behind Collation

The Unicode Common Locale Data Repository (CLDR) is the reference dataset for locale-aware operations. It defines:

Sort orders for each locale
Locale-specific rules (German phonebook vs. standard)
Character class memberships
Script ordering for mixed-script text

CLDR is what powers Intl.Collator, ICU, and most other internationalization libraries. When you specify 'sv' to Intl.Collator, the browser consults CLDR to learn Swedish sorting rules.

Common Pitfalls

Sorting objects by a text property

// Wrong
const products = [{name: 'Café'}, {name: 'cafe'}, {name: 'apple'}];
products.sort((a, b) => a.name > b.name ? 1 : -1);  // code point comparison

// Correct
const collator = new Intl.Collator('en');
products.sort((a, b) => collator.compare(a.name, b.name));

Normalization before collation

// Different normalization forms may sort differently
// Always normalize before collating:
const normalize = str => str.normalize('NFC');
const collator = new Intl.Collator('en');

array.sort((a, b) => collator.compare(normalize(a), normalize(b)));

Database collation vs. application collation

Sorting in the database and sorting in application code may produce different results if they use different collation implementations. For consistency, pick one layer for sorting and stick to it. For paginated sorted results, always sort in the database.

The "C" collation in PostgreSQL

The default C collation in PostgreSQL sorts by byte value — the same as code point order for UTF-8. It is fast but linguistically wrong for most use cases. If you need locale-aware sorting, you must explicitly use an ICU or libc collation.

-- Fast but linguistically wrong for non-ASCII:
SELECT * FROM users ORDER BY name COLLATE "C";

-- Correct for English:
SELECT * FROM users ORDER BY name COLLATE "en-US-x-icu";

-- Check current collation:
SELECT datcollate FROM pg_database WHERE datname = current_database();

Use the SymbolFYI Character Counter to inspect the exact code points in strings that are sorting unexpectedly — invisible characters, combining marks, and normalization issues are common culprits.

This concludes the Symbols for Developers series. From HTML entities through regex, JavaScript string internals, Python's Unicode model, URL encoding, security threats, font optimization, encoding detection, and collation — you now have a complete toolkit for working with Unicode across the web stack.