Unicode CLDR: The Database Behind Every Localized App

Unicode Deep Dive Unicode Deep Dive जुल 4, 2023

○ 1. What Is Unicode? The Universal Character Standard Explained
○ 2. Unicode Planes and Blocks: How 1.1 Million Code Points Are Organized
○ 3. Unicode Encodings Explained: UTF-8, UTF-16, and UTF-32 Compared
○ 4. Unicode Normalization: NFC, NFD, NFKC, and NFKD Explained
○ 5. Unicode Properties and Categories: Classifying Every Character
○ 6. Bidirectional Text in Unicode: How RTL and LTR Scripts Coexist
○ 7. How Emoji Work in Unicode: From Code Points to Skin Tones
○ 8. CJK Unification: How Unicode Handles Chinese, Japanese, and Korean
○ 9. Unicode Version History: From 1.0 to 16.0 and Beyond
● 10. Unicode CLDR: The Database Behind Every Localized App

विषय सूची

Every time your app shows "3,14" to a German user instead of "3.14", or displays "25 de março de 2024" to a Brazilian Portuguese speaker, or correctly uses "1 file" vs "2 files" in English but "1 Datei" vs "2 Dateien" in German, it is doing locale-aware formatting. The data that makes this possible almost certainly comes — directly or indirectly — from the Unicode Common Locale Data Repository (CLDR).

CLDR is the world's largest repository of locale data. It underpins iOS, Android, macOS, Windows, Java, JavaScript's Intl API, Python's Babel library, Go's golang.org/x/text, and virtually every serious i18n framework. If you have written any internationalized software in the past decade, you have used CLDR data.

What Is CLDR?

CLDR is a project of the Unicode Consortium, started in 2003, that collects and standardizes locale-specific data: the facts about how numbers, dates, times, currencies, and text behave differently in different human languages and regions.

The CLDR repository contains data for over 900 locales in XML format. It is versioned alongside Unicode (CLDR 46 corresponds to Unicode 16.0) and released twice yearly.

CLDR data answers questions like: - How do you format the number 1234567.89 in French? (1 234 567,89) - What is the order of day, month, and year in a Japanese date? (2024年3月25日) - What does "Monday" translate to in Swahili? (Jumatatu) - In Russian, how many plural forms does "file" have? (four: файл, файла, файлов, файлов) - What is the correct decimal separator in India? (. for most, but some regional variation) - How do you write the currency symbol for South Korean Won? (₩ or ₩ or KRW depending on context)

The Locale Identifier: BCP 47

CLDR uses BCP 47 (Best Current Practice 47) language tags to identify locales. BCP 47 tags are composed of subtags:

language[-script][-region][-variant][-extension]

Examples:
en           English (no region specified)
en-US        English, United States
en-GB        English, United Kingdom
zh-Hans      Chinese, Simplified script
zh-Hant-TW   Chinese, Traditional script, Taiwan
fr-CH        French, Switzerland
pt-BR        Portuguese, Brazil
sr-Cyrl      Serbian, Cyrillic script
sr-Latn      Serbian, Latin script

The -u- extension allows specifying Unicode-specific preferences:

en-US-u-nu-latn    English, US, using Latin numerals (explicit)
ar-u-nu-arab       Arabic, using Arabic-Indic numerals
zh-u-co-pinyin     Chinese, using Pinyin collation order
en-u-ca-buddhist   English, using Buddhist calendar

# Python Babel uses BCP 47 locale identifiers
from babel import Locale

loc = Locale.parse('fr_CH', sep='_')  # Babel uses underscore
print(loc.get_display_name('en'))     # 'French (Switzerland)'
print(loc.number_symbols['decimal'])  # '.' (Swiss French uses period)
print(loc.number_symbols['group'])    # '\u2019' (apostrophe as thousands sep)

Number Formatting

CLDR specifies number formatting patterns using a domain-specific notation. The key symbols:

Symbol	Meaning
0	Required digit
#	Optional digit
.	Decimal separator
,	Grouping separator (thousands)
%	Percent
¤	Currency sign
E	Scientific notation
+	Explicit plus sign

The actual characters used for decimal and grouping separators vary by locale:

Locale	Decimal	Thousands	Example
en-US	.	,	1,234,567.89
de-DE	,	.	1.234.567,89
fr-FR	,	space	1 234 567,89
fr-CH	.	apostrophe	1'234'567.89
hi-IN	.	, (Indian)	12,34,567.89
ar-SA	٫	٬	١٬٢٣٤٬٥٦٧٫٨٩

Python: Babel

from babel.numbers import format_number, format_currency, format_percent

number = 1234567.89

# Basic number formatting
print(format_number(number, locale='en_US'))  # 1,234,567.89
print(format_number(number, locale='de_DE'))  # 1.234.567,89
print(format_number(number, locale='fr_FR'))  # 1 234 567,89
print(format_number(number, locale='hi_IN'))  # 12,34,567.89

# Currency formatting
print(format_currency(1234.56, 'USD', locale='en_US'))   # $1,234.56
print(format_currency(1234.56, 'USD', locale='de_DE'))   # 1.234,56 $
print(format_currency(1234.56, 'EUR', locale='fr_FR'))   # 1 234,56 €
print(format_currency(1234.56, 'JPY', locale='ja_JP'))   # ¥1,235 (no decimals)

# Percent
print(format_percent(0.756, locale='en_US'))   # 75.6%
print(format_percent(0.756, locale='tr_TR'))   # %75,6 (% before number in Turkish)

# Parsing numbers from locale-formatted strings
from babel.numbers import parse_number
print(parse_number('1.234.567,89', locale='de_DE'))  # 1234567.89

JavaScript: Intl.NumberFormat

const number = 1234567.89;

// Basic formatting
new Intl.NumberFormat('en-US').format(number);   // '1,234,567.89'
new Intl.NumberFormat('de-DE').format(number);   // '1.234.567,89'
new Intl.NumberFormat('fr-FR').format(number);   // '1 234 567,89'
new Intl.NumberFormat('hi-IN').format(number);   // '12,34,567.89'
new Intl.NumberFormat('ar-SA').format(number);   // '١٬٢٣٤٬٥٦٧٫٨٩'

// Currency
new Intl.NumberFormat('en-US', {
    style: 'currency',
    currency: 'USD'
}).format(1234.56);  // '$1,234.56'

new Intl.NumberFormat('ja-JP', {
    style: 'currency',
    currency: 'JPY'
}).format(1234.56);  // '¥1,235'

// Compact notation
new Intl.NumberFormat('en-US', { notation: 'compact' }).format(1234567);  // '1.2M'
new Intl.NumberFormat('ko-KR', { notation: 'compact' }).format(1234567);  // '123만' (10,000 units)

// Percent
new Intl.NumberFormat('en-US', { style: 'percent' }).format(0.756);  // '75.6%'
new Intl.NumberFormat('tr-TR', { style: 'percent' }).format(0.756);  // '%75,6'

Date and Time Formatting

CLDR contains patterns for date and time formatting in every locale. Patterns use a notation based on ISO 8601 but extended:

Pattern	Meaning	Example (en-US)
y	Year	2024
M/L	Month (numeric/standalone)	3
MMM	Month abbreviated	Mar
MMMM	Month full	March
d	Day	25
E	Day of week abbreviated	Mon
EEEE	Day of week full	Monday
h/H	Hour (12/24)	3 / 15
m	Minute	45
s	Second	30
a	AM/PM	PM

from babel.dates import format_date, format_datetime, format_time
from datetime import date, datetime

d = date(2024, 3, 25)
dt = datetime(2024, 3, 25, 15, 30, 0)

# Date formats
print(format_date(d, format='long', locale='en_US'))   # March 25, 2024
print(format_date(d, format='long', locale='de_DE'))   # 25. März 2024
print(format_date(d, format='long', locale='ja_JP'))   # 2024年3月25日
print(format_date(d, format='long', locale='ar_SA'))   # ٢٥ مارس ٢٠٢٤

# Time formats (note: 12h vs 24h varies by locale)
print(format_time(dt, format='short', locale='en_US'))  # 3:30 PM
print(format_time(dt, format='short', locale='de_DE'))  # 15:30
print(format_time(dt, format='short', locale='zh_CN'))  # 下午3:30

# Full datetime
print(format_datetime(dt, format='full', locale='en_US'))
# Monday, March 25, 2024 at 3:30:00 PM Coordinated Universal Time

print(format_datetime(dt, format='full', locale='fr_FR'))
# lundi 25 mars 2024 à 15:30:00 Temps universel coordonné

const date = new Date(2024, 2, 25, 15, 30, 0);  // March 25, 2024

// Date
new Intl.DateTimeFormat('en-US', { dateStyle: 'long' }).format(date);
// 'March 25, 2024'

new Intl.DateTimeFormat('de-DE', { dateStyle: 'long' }).format(date);
// '25. März 2024'

new Intl.DateTimeFormat('ja-JP', { dateStyle: 'long' }).format(date);
// '2024年3月25日'

// Time
new Intl.DateTimeFormat('en-US', { timeStyle: 'short' }).format(date);
// '3:30 PM'

new Intl.DateTimeFormat('de-DE', { timeStyle: 'short' }).format(date);
// '15:30'

// Relative time
const rtf = new Intl.RelativeTimeFormat('en', { numeric: 'auto' });
rtf.format(-1, 'day');   // 'yesterday'
rtf.format(2, 'week');   // 'in 2 weeks'

const rtfFr = new Intl.RelativeTimeFormat('fr', { numeric: 'auto' });
rtfFr.format(-1, 'day'); // 'hier' (yesterday in French)
rtfFr.format(2, 'week'); // 'dans 2 semaines'

Plural Categories

One of the most underappreciated aspects of localization is plural rules. English has two forms: singular ("1 file") and plural ("2 files"). Many languages have more:

Language	Forms	Rules
English	2	1 = one; others = other
French	2	0 and 1 = one; others = other
German	2	1 = one; others = other
Russian	4	1 = one; 2-4 = few; 5-20 = many; other
Arabic	6	0 = zero; 1 = one; 2 = two; 3-10 = few; 11-99 = many; other
Japanese	1	always = other (no plural distinction)
Polish	4	1 = one; ends in 2-4 (not 12-14) = few; others = many; other

CLDR defines six plural categories: zero, one, two, few, many, other. Not all languages use all six; every language uses "other".

from babel.plural import get_plural_tag
from babel.core import Locale

# Get plural form for a number
def get_plural_form(n, locale_str):
    loc = Locale.parse(locale_str)
    tag = loc.plural_form(n)
    return tag

print(get_plural_form(1, 'en_US'))   # one
print(get_plural_form(2, 'en_US'))   # other
print(get_plural_form(0, 'fr_FR'))   # one (French treats 0 as singular)
print(get_plural_form(1, 'ru_RU'))   # one
print(get_plural_form(2, 'ru_RU'))   # few
print(get_plural_form(5, 'ru_RU'))   # many
print(get_plural_form(1, 'ar'))      # one
print(get_plural_form(2, 'ar'))      # two
print(get_plural_form(5, 'ar'))      # few
print(get_plural_form(11, 'ar'))     # many

In JavaScript, Intl.PluralRules exposes CLDR plural rules:

// English: one vs other
const en = new Intl.PluralRules('en');
[0, 1, 2, 5, 10].map(n => `${n}: ${en.select(n)}`);
// ['0: other', '1: one', '2: other', '5: other', '10: other']

// Russian: one, few, many, other
const ru = new Intl.PluralRules('ru');
[1, 2, 5, 11, 21].map(n => `${n}: ${ru.select(n)}`);
// ['1: one', '2: few', '5: many', '11: many', '21: one']

// Practical use: select the correct string
function pluralize(n, forms, locale) {
    const plural = new Intl.PluralRules(locale);
    const form = plural.select(n);
    return `${n} ${forms[form] || forms.other}`;
}

// English
pluralize(1, { one: 'file', other: 'files' }, 'en');   // '1 file'
pluralize(5, { one: 'file', other: 'files' }, 'en');   // '5 files'

// Russian (conceptual — real app would use i18n library)
pluralize(1, { one: 'файл', few: 'файла', many: 'файлов', other: 'файлов' }, 'ru');
// '1 файл'

Collation: Sorting Text

CLDR provides collation rules for sorting text in a linguistically correct way. Byte-order sorting fails spectacularly for international text:

# Wrong: byte-order sort
names = ['Ångström', 'Åse', 'Bitte', 'ändern', 'Über']
print(sorted(names))
# ['Bitte', 'Über', 'Ångström', 'Åse', 'ändern']  — completely wrong for German/Swedish

# Correct: ICU collation via PyICU
# pip install PyICU
import icu

# German collation
german_coll = icu.Collator.createInstance(icu.Locale('de_DE'))
print(sorted(names, key=german_coll.getSortKey))
# ['ändern', 'Ångström', 'Åse', 'Bitte', 'Über']  — ä/Å near a, Ü near u

# Swedish collation (ä and ö sort after z!)
swedish_coll = icu.Collator.createInstance(icu.Locale('sv_SE'))
print(sorted(names, key=swedish_coll.getSortKey))
# ['Bitte', 'Über', 'ändern', 'Ångström', 'Åse']  — ä/å after z in Swedish

// JavaScript: Intl.Collator
const names = ['Ångström', 'Åse', 'Bitte', 'ändern', 'Über'];

// German collation
const german = new Intl.Collator('de-DE');
console.log(names.sort((a, b) => german.compare(a, b)));
// ['ändern', 'Ångström', 'Åse', 'Bitte', 'Über']

// Swedish collation
const swedish = new Intl.Collator('sv-SE');
console.log(names.sort((a, b) => swedish.compare(a, b)));
// ['Bitte', 'Über', 'ändern', 'Ångström', 'Åse']

// Chinese by Pinyin
const chineseWords = ['中文', '北京', '汉字', '语言'];
const pinyin = new Intl.Collator('zh-Hans-u-co-pinyin');
console.log(chineseWords.sort((a, b) => pinyin.compare(a, b)));

List Formatting

How you join list items varies by language:

const items = ['apples', 'oranges', 'bananas'];

new Intl.ListFormat('en', { type: 'conjunction' }).format(items);
// 'apples, oranges, and bananas'

new Intl.ListFormat('fr', { type: 'conjunction' }).format(items);
// 'apples, oranges et bananas' (no Oxford comma)

new Intl.ListFormat('de', { type: 'conjunction' }).format(items);
// 'apples, oranges und bananas'

new Intl.ListFormat('zh', { type: 'conjunction' }).format(items);
// 'apples、oranges和bananas' (uses Chinese enumeration comma 、)

from babel.lists import format_list

print(format_list(['apples', 'oranges', 'bananas'], locale='en_US'))
# 'apples, oranges, and bananas'

print(format_list(['apples', 'oranges', 'bananas'], locale='fr_FR'))
# 'apples, oranges et bananas'

Display Names and Locale Names

CLDR includes the names of all languages, scripts, regions, and currencies in every language:

// Names of languages in different display languages
new Intl.DisplayNames(['en'], { type: 'language' }).of('de');  // 'German'
new Intl.DisplayNames(['de'], { type: 'language' }).of('de');  // 'Deutsch'
new Intl.DisplayNames(['ja'], { type: 'language' }).of('de');  // 'ドイツ語'
new Intl.DisplayNames(['ko'], { type: 'language' }).of('de');  // '독일어'

// Region names
new Intl.DisplayNames(['en'], { type: 'region' }).of('JP');  // 'Japan'
new Intl.DisplayNames(['ja'], { type: 'region' }).of('JP');  // '日本'
new Intl.DisplayNames(['ar'], { type: 'region' }).of('JP');  // 'اليابان'

// Currency names
new Intl.DisplayNames(['en'], { type: 'currency' }).of('JPY');  // 'Japanese Yen'
new Intl.DisplayNames(['ja'], { type: 'currency' }).of('JPY');  // '日本円'

// Script names
new Intl.DisplayNames(['en'], { type: 'script' }).of('Hang');  // 'Hangul'
new Intl.DisplayNames(['ko'], { type: 'script' }).of('Hang');  // '한글'

CLDR in Backend Systems

Django and Babel

Django's built-in django.utils.formats and django.utils.translation use CLDR data via PyBabel or direct Django locale data:

# Django template usage
# {% load i18n %}
# {{ value|intcomma }}     → locale-aware thousands separator
# {{ value|floatformat:2}} → locale-aware decimal formatting

# Python code
from django.utils.formats import number_format, date_format
from django.utils import translation

with translation.override('de'):
    print(number_format(1234567.89))         # '1.234.567,89'
    print(date_format(date(2024, 3, 25), 'D, j N Y'))
    # 'Mo., 25. März 2024'

with translation.override('ja'):
    print(date_format(date(2024, 3, 25), 'Y年n月j日'))
    # '2024年3月25日'

Database and CLDR Data Files

CLDR releases its data as both XML and JSON. For custom applications or high-performance needs:

# Access CLDR JSON data directly
# npm install @unicode/cldr-core (Node.js)
# or pip install cldr-json-fetch (Python)

import json

# Load CLDR number data for German
# (assuming cldr-json data directory is available)
with open('cldr-numbers-modern/main/de/numbers.json') as f:
    de_numbers = json.load(f)

symbols = de_numbers['main']['de']['numbers']['symbols-numberSystem-latn']
print(symbols['decimal'])   # ','
print(symbols['group'])     # '.'

patterns = de_numbers['main']['de']['numbers']['decimalFormats-numberSystem-latn']
print(patterns['standard'])  # '#,##0.###'

The CLDR Survey Tool

CLDR data is contributed and verified through the Survey Tool, a web interface where vetted contributors from each language community review and approve locale data. This crowd-sourced verification process is how CLDR maintains data accuracy across hundreds of locales — native speakers can flag incorrect translations of month names, wrong decimal separators, or culturally inappropriate date formats.

If you find incorrect locale data in CLDR — and native speakers sometimes do — the process for submitting corrections is described at cldr.unicode.org.

Putting It All Together

A fully internationalized web application typically uses:

// Comprehensive i18n setup (React/Node.js example pattern)
function formatForLocale(locale) {
    const number = 1234567.89;
    const date = new Date(2024, 2, 25);
    const amount = 1234.56;
    const items = ['apple', 'banana', 'cherry'];

    return {
        number: new Intl.NumberFormat(locale).format(number),
        currency: new Intl.NumberFormat(locale, {
            style: 'currency',
            currency: 'USD'
        }).format(amount),
        date: new Intl.DateTimeFormat(locale, {
            dateStyle: 'long'
        }).format(date),
        list: new Intl.ListFormat(locale, {
            type: 'conjunction'
        }).format(items),
        plural: (() => {
            const pr = new Intl.PluralRules(locale);
            const n = 5;
            const forms = {
                en: { one: 'file', other: 'files' },
                de: { one: 'Datei', other: 'Dateien' },
                ru: { one: 'файл', few: 'файла', many: 'файлов', other: 'файлов' }
            };
            const form = pr.select(n);
            const lang = new Intl.Locale(locale).language;
            return `${n} ${(forms[lang] || forms.en)[form]}`;
        })()
    };
}

console.log(formatForLocale('en-US'));
// { number: '1,234,567.89', currency: '$1,234.56',
//   date: 'March 25, 2024', list: 'apple, banana, and cherry',
//   plural: '5 files' }

console.log(formatForLocale('de-DE'));
// { number: '1.234.567,89', currency: '1.234,56 $',
//   date: '25. März 2024', list: 'apple, banana und cherry',
//   plural: '5 Dateien' }

Our Character Counter uses CLDR-based segmentation to accurately count grapheme clusters across all languages, making it reliable for text handling across every script in Unicode.

Summary

CLDR is the invisible infrastructure behind every localized application:

Number formatting: Decimal separators, grouping separators, and notation vary by locale
Date and time patterns: The order, names, and formats of date components differ widely
Plural rules: English has two forms; Arabic has six; Japanese has one
Collation: Alphabetical order is language-dependent and often counterintuitive
List formatting: Conjunctions and separators vary by language
Display names: The names of languages, regions, currencies in each language

In JavaScript, the Intl API (available in all modern browsers and Node.js) provides direct access to CLDR-backed formatting. In Python, Babel is the primary CLDR-based library. In Java, the ICU library provides the most complete implementation. All of them ultimately derive their locale data from the same CLDR source.

Series Complete! You have now covered the full Unicode Deep Dive series. Start with What Is Unicode? to revisit the foundations, or use our Unicode Lookup tool to explore any character in depth.