SymbolFYI

Unicode Normalization

Encoding

Определение

The process of converting Unicode text to a standard form (NFC, NFD, NFKC, NFKD) to ensure consistent comparison and storage.

Unicode normalization is the process of converting text to a standard canonical form so that equivalent character sequences compare as equal. Because Unicode allows multiple ways to represent the same visible character, normalization is essential for correct string comparison, searching, sorting, and storage.

The Problem: Multiple Representations

Consider the character e with an acute accent. In Unicode it can be encoded two ways:

Precomposed: U+00E9 -- a single code point representing the combined character
Decomposed: U+0065 e + U+0301 (combining acute) -- the base letter followed by a combining accent mark

Both render identically, but as byte sequences they are different. Without normalization, a search might miss results stored in the other form.

The Four Normalization Forms

Unicode defines four normalization forms:

Form	Full Name	Effect
NFC	Canonical Decomposition + Canonical Composition	Decompose, then recompose to precomposed forms. Shortest form for most text.
NFD	Canonical Decomposition	Fully decompose all characters into base + combining marks.
NFKC	Compatibility Decomposition + Canonical Composition	Normalize compatibility variants (e.g., fi-ligature to fi, circled-1 to 1).
NFKD	Compatibility Decomposition	Fully decompose including compatibility mappings.

NFC is the recommended form for most applications. It is the form used by the web, by macOS file system (HFS+), and by most databases.

NFD is used by some macOS APIs internally (the HFS+ file system historically stored filenames in NFD), which can cause subtle bugs when filenames are compared across systems.

NFKC/NFKD additionally normalize compatibility equivalents: superscripts, fractions, Roman numeral characters, and ligatures are mapped to their plain equivalents. Useful for search indexing and text analysis, but lossy -- it discards formatting distinctions.

Normalization in Code

import unicodedata

e_precomposed = '\u00e9'         # e-acute as one code point
e_decomposed  = 'e\u0301'       # e + combining acute

print(e_precomposed == e_decomposed)  # False!

nfc_pre  = unicodedata.normalize('NFC', e_precomposed)
nfc_dec  = unicodedata.normalize('NFC', e_decomposed)
print(nfc_pre == nfc_dec)  # True
print(len(nfc_pre))        # 1
print(len(unicodedata.normalize('NFD', e_precomposed)))  # 2

// JavaScript: String.prototype.normalize()
const precomposed = '\u00e9';
const decomposed  = 'e\u0301';

console.log(precomposed === decomposed);  // false
console.log(
  precomposed.normalize('NFC') === decomposed.normalize('NFC')
);  // true

// Check form
console.log(decomposed.normalize('NFC').length);  // 1
console.log(decomposed.normalize('NFD').length);  // 2

Canonical Ordering

Normalization also enforces canonical ordering of combining marks. If multiple combining characters appear in sequence, NFD reorders them by their Canonical Combining Class value. This ensures that a + grave + macron and a + macron + grave both normalize to the same sequence, even though the visual result is the same.

When to Normalize

At input boundaries: normalize user input to NFC before storing in a database
Before string comparison: normalize both sides to the same form
In search: consider NFKC to match ligatures with their component letters and typographic numbers with plain digits
File system operations: be aware that macOS may return NFD filenames; normalize before comparing with user-supplied strings

Связанные термины