SymbolFYI

Unicode Normalization

Encoding
Определение

The process of converting Unicode text to a standard form (NFC, NFD, NFKC, NFKD) to ensure consistent comparison and storage.

Unicode normalization is the process of converting text to a standard canonical form so that equivalent character sequences compare as equal. Because Unicode allows multiple ways to represent the same visible character, normalization is essential for correct string comparison, searching, sorting, and storage.

The Problem: Multiple Representations

Consider the character e with an acute accent. In Unicode it can be encoded two ways:

  • Precomposed: U+00E9 -- a single code point representing the combined character
  • Decomposed: U+0065 e + U+0301 (combining acute) -- the base letter followed by a combining accent mark

Both render identically, but as byte sequences they are different. Without normalization, a search might miss results stored in the other form.

The Four Normalization Forms

Unicode defines four normalization forms:

Form Full Name Effect
NFC Canonical Decomposition + Canonical Composition Decompose, then recompose to precomposed forms. Shortest form for most text.
NFD Canonical Decomposition Fully decompose all characters into base + combining marks.
NFKC Compatibility Decomposition + Canonical Composition Normalize compatibility variants (e.g., fi-ligature to fi, circled-1 to 1).
NFKD Compatibility Decomposition Fully decompose including compatibility mappings.

NFC is the recommended form for most applications. It is the form used by the web, by macOS file system (HFS+), and by most databases.

NFD is used by some macOS APIs internally (the HFS+ file system historically stored filenames in NFD), which can cause subtle bugs when filenames are compared across systems.

NFKC/NFKD additionally normalize compatibility equivalents: superscripts, fractions, Roman numeral characters, and ligatures are mapped to their plain equivalents. Useful for search indexing and text analysis, but lossy -- it discards formatting distinctions.

Normalization in Code

import unicodedata

e_precomposed = '\u00e9'         # e-acute as one code point
e_decomposed  = 'e\u0301'       # e + combining acute

print(e_precomposed == e_decomposed)  # False!

nfc_pre  = unicodedata.normalize('NFC', e_precomposed)
nfc_dec  = unicodedata.normalize('NFC', e_decomposed)
print(nfc_pre == nfc_dec)  # True
print(len(nfc_pre))        # 1
print(len(unicodedata.normalize('NFD', e_precomposed)))  # 2
// JavaScript: String.prototype.normalize()
const precomposed = '\u00e9';
const decomposed  = 'e\u0301';

console.log(precomposed === decomposed);  // false
console.log(
  precomposed.normalize('NFC') === decomposed.normalize('NFC')
);  // true

// Check form
console.log(decomposed.normalize('NFC').length);  // 1
console.log(decomposed.normalize('NFD').length);  // 2

Canonical Ordering

Normalization also enforces canonical ordering of combining marks. If multiple combining characters appear in sequence, NFD reorders them by their Canonical Combining Class value. This ensures that a + grave + macron and a + macron + grave both normalize to the same sequence, even though the visual result is the same.

When to Normalize

  • At input boundaries: normalize user input to NFC before storing in a database
  • Before string comparison: normalize both sides to the same form
  • In search: consider NFKC to match ligatures with their component letters and typographic numbers with plain digits
  • File system operations: be aware that macOS may return NFD filenames; normalize before comparing with user-supplied strings

Похожие символы

Связанные термины

Связанные инструменты

Связанные руководства