Unicode Normalization: NFC, NFD, NFKC, and NFKD Explained
- ○ 1. What Is Unicode? The Universal Character Standard Explained
- ○ 2. Unicode Planes and Blocks: How 1.1 Million Code Points Are Organized
- ○ 3. Unicode Encodings Explained: UTF-8, UTF-16, and UTF-32 Compared
- ● 4. Unicode Normalization: NFC, NFD, NFKC, and NFKD Explained
- ○ 5. Unicode Properties and Categories: Classifying Every Character
- ○ 6. Bidirectional Text in Unicode: How RTL and LTR Scripts Coexist
- ○ 7. How Emoji Work in Unicode: From Code Points to Skin Tones
- ○ 8. CJK Unification: How Unicode Handles Chinese, Japanese, and Korean
- ○ 9. Unicode Version History: From 1.0 to 16.0 and Beyond
- ○ 10. Unicode CLDR: The Database Behind Every Localized App
Here is a deceptively simple puzzle. Consider these two strings:
String A: "café"
String B: "café"
They look identical. If you copy both into a text editor, you cannot distinguish them. But they are not equal as byte sequences — and in most programming languages, a naive string comparison will report them as different.
This is the problem that Unicode normalization solves. Understanding normalization is essential for any developer working with user input, internationalized databases, or string comparison.
Why the Same Character Has Two Representations
The character é (e with acute accent) can be encoded in two valid ways in Unicode:
- Precomposed: A single code point U+00E9 (LATIN SMALL LETTER E WITH ACUTE)
- Decomposed: Two code points — U+0065 (LATIN SMALL LETTER E) followed by U+0301 (COMBINING ACUTE ACCENT)
Both representations are valid Unicode. Both render identically. But they are byte-different:
import unicodedata
# Precomposed form
s1 = '\u00e9' # Single code point: é
# Decomposed form
s2 = '\u0065\u0301' # Two code points: e + combining accent
print(len(s1)) # 1
print(len(s2)) # 2
print(s1 == s2) # False — different bytes!
print(unicodedata.name(s1)) # LATIN SMALL LETTER E WITH ACUTE
# s2 is two characters:
for c in s2:
print(unicodedata.name(c))
# LATIN SMALL LETTER E
# COMBINING ACUTE ACCENT
This duality is not an accident or an oversight. Unicode deliberately preserves both forms for compatibility with legacy encodings that only had one form or the other. But it means that character comparison requires a normalization step.
The Four Normalization Forms
Unicode defines four normalization forms, organized along two dimensions:
Decomposition type: - Canonical: Preserves meaning exactly; decomposes only when representations are canonically equivalent - Compatibility: Also decomposes characters that are "compatible but not identical" in meaning (ligatures, width variants, etc.)
Composition step: - D (Decomposed): Apply decomposition, do not re-compose - C (Composed): Apply decomposition, then re-compose where possible
This gives four forms:
| Form | Decomposition | Composition | Full Name |
|---|---|---|---|
| NFC | Canonical | Yes | Normalization Form Canonical Composition |
| NFD | Canonical | No | Normalization Form Canonical Decomposition |
| NFKC | Compatibility | Yes | Normalization Form Compatibility Composition |
| NFKD | Compatibility | No | Normalization Form Compatibility Decomposition |
Canonical Equivalence: NFC and NFD
Canonical equivalence means two representations are identical in meaning and should always be treated as the same character. The precomposed é (U+00E9) and the decomposed e+combining-accent sequence are canonically equivalent.
NFD — Canonical Decomposition
NFD decomposes every character into its base letter plus combining characters, in a canonical order. Precomposed characters are broken apart; base + combining sequences are ordered consistently.
import unicodedata
text = "café"
nfd = unicodedata.normalize('NFD', text)
print(len(text)) # 4 (c-a-f-é as precomposed)
print(len(nfd)) # 5 (c-a-f-e-combining_accent)
# Show code points
for c in nfd:
print(f'U+{ord(c):04X} {unicodedata.name(c, "?")}')
# U+0063 LATIN SMALL LETTER C
# U+0061 LATIN SMALL LETTER A
# U+0066 LATIN SMALL LETTER F
# U+0065 LATIN SMALL LETTER E
# U+0301 COMBINING ACUTE ACCENT
NFC — Canonical Decomposition, then Recomposition
NFC first decomposes into NFD, then re-composes combining sequences into precomposed characters where a precomposed form exists.
nfc = unicodedata.normalize('NFC', text)
print(len(nfc)) # 4 — é is recomposed to single U+00E9
# Normalize decomposed text back to precomposed
decomposed = '\u0065\u0301' # e + combining accent
composed = unicodedata.normalize('NFC', decomposed)
print(composed == 'é') # True
print(ord(composed)) # 233 (U+00E9)
NFC is the preferred form for most purposes. It is what macOS and most web browsers produce when you type accented characters. It is compact (fewer code points than NFD) and widely expected.
NFD is useful for string processing tasks where you want to operate on base letters separately from their accents — for example, stripping all diacritics:
def remove_diacritics(text):
"""Remove all diacritical marks from text."""
nfd = unicodedata.normalize('NFD', text)
return ''.join(c for c in nfd
if unicodedata.category(c) != 'Mn')
# Mn = Mark, Nonspacing (combining marks)
print(remove_diacritics("café résumé naïve"))
# 'cafe resume naive'
Compatibility Equivalence: NFKC and NFKD
Compatibility equivalence is a weaker relationship. Compatible characters look similar and can be substituted in some contexts, but they carry distinct semantic information. For example:
- Ligature fi (fi, U+FB01) is a typographic variant of the two-character sequence "fi" — same meaning, different glyph
- Fullwidth A (A, U+FF21) is the same letter as A but displayed in a wide CJK-style box
- Superscript 2 (², U+00B2) is a compatibility form of the digit 2
- Roman numeral IV (Ⅳ, U+2163) is a compatibility form of the letter sequence "IV"
NFKC and NFKD apply compatibility decomposition, which maps these variants to their simpler equivalents.
text_with_compat = "find the ² factorial of Ⅳ in A"
nfkc = unicodedata.normalize('NFKC', text_with_compat)
print(nfkc)
# 'find the 2 factorial of IV in A'
# fi ligature → f + i
# ² → 2
# Ⅳ → IV
# A → A
NFKD and NFKC Compared
NFKD applies compatibility decomposition without re-composing. NFKC applies compatibility decomposition and then re-composes (same as NFC but with the broader compatibility decomposition step first).
# Combining both: NFKD then strip non-spacing marks = aggressive ASCII folding
def ascii_fold(text):
"""Map accented/variant chars toward ASCII equivalents."""
nfkd = unicodedata.normalize('NFKD', text)
return ''.join(c for c in nfkd
if not unicodedata.combining(c))
print(ascii_fold("find ² × A in café"))
# 'find 2 x A in cafe'
Warning: NFKC/NFKD are lossy. The transformation from ² to 2, or from fi to fi, destroys information. Use them only in contexts where you genuinely do not care about these distinctions — search indexing, fuzzy matching, slug generation — not for storing or displaying text faithfully.
Canonical Ordering of Combining Marks
There is a subtlety in NFD beyond simple decomposition: combining characters must appear in canonical order. The canonical order is determined by the Canonical Combining Class (CCC) property, a number assigned to each combining character.
A sequence of combining characters attached to a base is sorted by CCC value in ascending order. This ensures that different orderings of the same combining marks produce the same NFD result.
# Two orderings of the same combining marks on 'a'
# CCC 230 = above, CCC 220 = below
s1 = 'a\u0308\u0325' # a + diaeresis (above, CCC 230) + ring below (CCC 220)
s2 = 'a\u0325\u0308' # a + ring below (CCC 220) + diaeresis (above, CCC 230)
print(s1 == s2) # False — different byte order
# After NFD normalization, canonical order is applied
nfd1 = unicodedata.normalize('NFD', s1)
nfd2 = unicodedata.normalize('NFD', s2)
print(nfd1 == nfd2) # True — same canonical form
# Get CCC values
print(unicodedata.combining('\u0308')) # 230 (diaeresis, above)
print(unicodedata.combining('\u0325')) # 220 (ring, below)
Practical Recommendations
Use NFC for Storage
When storing user-supplied text — in a database, a file, an API response — normalize to NFC first. This ensures:
- Consistent comparison: name == stored_name will work correctly
- Consistent indexing: Database indexes work correctly on NFC text
- Predictable length: Fewer surprises with len() or column width constraints
import unicodedata
def normalize_input(text: str) -> str:
"""Normalize user input to NFC before storage."""
return unicodedata.normalize('NFC', text.strip())
# In a web form handler:
username = normalize_input(request.POST['username'])
Use NFKC for Search and Comparison
For search, slug generation, or fuzzy matching where you want ② and 2 to match, or fi and fi to match:
def search_normalize(text: str) -> str:
"""Normalize text for search index."""
return unicodedata.normalize('NFKC', text).casefold()
# 'café', 'CAFÉ', 'café' (decomposed) all normalize to 'café'
# (casefold handles the case folding, NFKC handles composition variants)
JavaScript Normalization
JavaScript strings expose normalization via String.prototype.normalize():
// NFC normalization
const s1 = '\u00e9'; // precomposed é
const s2 = '\u0065\u0301'; // decomposed e + combining accent
console.log(s1 === s2); // false
console.log(s1.normalize('NFC') === s2.normalize('NFC')); // true
// Common pattern: normalize before comparison
function normalizeForStorage(str) {
return str.normalize('NFC').trim();
}
function normalizeForSearch(str) {
return str.normalize('NFKC').toLowerCase();
}
// Strip diacritics (NFC fold approach)
function stripDiacritics(str) {
return str.normalize('NFD')
.replace(/[\u0300-\u036f]/g, '');
}
console.log(stripDiacritics('café résumé')); // 'cafe resume'
Database Considerations
Most modern databases store text in Unicode, but their collation rules affect comparison. In PostgreSQL:
-- Check current database encoding
SELECT pg_encoding_to_char(encoding) FROM pg_database WHERE datname = current_database();
-- Unicode-aware case-insensitive search with ILIKE
SELECT * FROM users WHERE username ILIKE 'cafe%';
-- For proper normalization, normalize before insertion
-- (PostgreSQL does not auto-normalize; do it in application code)
In Python with Django, normalize before saving:
from django.db import models
import unicodedata
class UserProfile(models.Model):
name = models.CharField(max_length=200)
def save(self, *args, **kwargs):
self.name = unicodedata.normalize('NFC', self.name)
super().save(*args, **kwargs)
Confusables and Security
Normalization intersects with security through Unicode confusables — characters that look visually similar but are distinct code points. For example:
- Latin
a(U+0061) vs Cyrillicа(U+0430) — identical appearance in most fonts - Latin
o(U+006F) vs Greekο(U+03BF) - Latin
cvs Cyrillicс, Latinevs Cyrillicе
NFKC normalization does NOT resolve confusables — they are distinct code points with no compatibility relationship. Confusable detection requires additional tools like the Unicode Security Mechanisms (UTS #39).
However, normalization does prevent one class of attack: using multiple equivalent representations of the same character to bypass filters. If your code filters the string "<script>" but allows its NFD-decomposed form to pass through, you have a bypass vulnerability. Always normalize before processing or comparing security-sensitive strings.
# Safe pattern: normalize before filtering
def safe_check(user_input: str) -> bool:
normalized = unicodedata.normalize('NFC', user_input)
# Now compare, filter, or validate against the normalized form
return '<script>' not in normalized.lower()
The Stability Guarantee
One final reassurance: Unicode normalization forms are stable. A string normalized to NFC in Unicode 6.0 is still valid NFC in Unicode 16.0. The Unicode Consortium guarantees that new character additions will not invalidate existing normalized strings.
This means you can normalize text once and store the normalized form, confident that it will remain normalized across Unicode version upgrades. The Unicode Version History article covers how Unicode manages these stability guarantees.
Summary
| Form | Use When |
|---|---|
| NFC | Storage, display, file names (default choice for most uses) |
| NFD | Stripping diacritics, linguistic analysis of combining marks |
| NFKC | Search normalization, slug generation, fuzzy comparison |
| NFKD | Aggressive folding to base characters (lossy) |
The key takeaway: always normalize user input to NFC before storing, indexing, or comparing. For search and fuzzy matching, consider NFKC. Never rely on raw byte comparison for Unicode strings without normalizing first.
Use our Character Counter to inspect the Unicode properties and normalization form of any text.
Next in Series: Unicode Properties and Categories: Classifying Every Character — Explore the metadata attached to every Unicode character, from General Category to Script and Bidi class.