Unicode Normalization: NFC, NFD, NFKC, and NFKD Explained

Unicode Deep Dive Unicode Deep Dive Mär 28, 2023

○ 1. What Is Unicode? The Universal Character Standard Explained
○ 2. Unicode Planes and Blocks: How 1.1 Million Code Points Are Organized
○ 3. Unicode Encodings Explained: UTF-8, UTF-16, and UTF-32 Compared
● 4. Unicode Normalization: NFC, NFD, NFKC, and NFKD Explained
○ 5. Unicode Properties and Categories: Classifying Every Character
○ 6. Bidirectional Text in Unicode: How RTL and LTR Scripts Coexist
○ 7. How Emoji Work in Unicode: From Code Points to Skin Tones
○ 8. CJK Unification: How Unicode Handles Chinese, Japanese, and Korean
○ 9. Unicode Version History: From 1.0 to 16.0 and Beyond
○ 10. Unicode CLDR: The Database Behind Every Localized App

Inhaltsverzeichnis

Here is a deceptively simple puzzle. Consider these two strings:

String A: "café"
String B: "café"

They look identical. If you copy both into a text editor, you cannot distinguish them. But they are not equal as byte sequences — and in most programming languages, a naive string comparison will report them as different.

This is the problem that Unicode normalization solves. Understanding normalization is essential for any developer working with user input, internationalized databases, or string comparison.

Why the Same Character Has Two Representations

The character é (e with acute accent) can be encoded in two valid ways in Unicode:

Precomposed: A single code point U+00E9 (LATIN SMALL LETTER E WITH ACUTE)
Decomposed: Two code points — U+0065 (LATIN SMALL LETTER E) followed by U+0301 (COMBINING ACUTE ACCENT)

Both representations are valid Unicode. Both render identically. But they are byte-different:

import unicodedata

# Precomposed form
s1 = '\u00e9'            # Single code point: é
# Decomposed form
s2 = '\u0065\u0301'      # Two code points: e + combining accent

print(len(s1))   # 1
print(len(s2))   # 2
print(s1 == s2)  # False — different bytes!

print(unicodedata.name(s1))   # LATIN SMALL LETTER E WITH ACUTE
# s2 is two characters:
for c in s2:
    print(unicodedata.name(c))
# LATIN SMALL LETTER E
# COMBINING ACUTE ACCENT

This duality is not an accident or an oversight. Unicode deliberately preserves both forms for compatibility with legacy encodings that only had one form or the other. But it means that character comparison requires a normalization step.

The Four Normalization Forms

Unicode defines four normalization forms, organized along two dimensions:

Decomposition type: - Canonical: Preserves meaning exactly; decomposes only when representations are canonically equivalent - Compatibility: Also decomposes characters that are "compatible but not identical" in meaning (ligatures, width variants, etc.)

Composition step: - D (Decomposed): Apply decomposition, do not re-compose - C (Composed): Apply decomposition, then re-compose where possible

This gives four forms:

Form	Decomposition	Composition	Full Name
NFC	Canonical	Yes	Normalization Form Canonical Composition
NFD	Canonical	No	Normalization Form Canonical Decomposition
NFKC	Compatibility	Yes	Normalization Form Compatibility Composition
NFKD	Compatibility	No	Normalization Form Compatibility Decomposition

Canonical Equivalence: NFC and NFD

Canonical equivalence means two representations are identical in meaning and should always be treated as the same character. The precomposed é (U+00E9) and the decomposed e+combining-accent sequence are canonically equivalent.

NFD — Canonical Decomposition

NFD decomposes every character into its base letter plus combining characters, in a canonical order. Precomposed characters are broken apart; base + combining sequences are ordered consistently.

import unicodedata

text = "café"

nfd = unicodedata.normalize('NFD', text)
print(len(text))   # 4 (c-a-f-é as precomposed)
print(len(nfd))    # 5 (c-a-f-e-combining_accent)

# Show code points
for c in nfd:
    print(f'U+{ord(c):04X} {unicodedata.name(c, "?")}')
# U+0063 LATIN SMALL LETTER C
# U+0061 LATIN SMALL LETTER A
# U+0066 LATIN SMALL LETTER F
# U+0065 LATIN SMALL LETTER E
# U+0301 COMBINING ACUTE ACCENT

NFC — Canonical Decomposition, then Recomposition

NFC first decomposes into NFD, then re-composes combining sequences into precomposed characters where a precomposed form exists.

nfc = unicodedata.normalize('NFC', text)
print(len(nfc))    # 4 — é is recomposed to single U+00E9

# Normalize decomposed text back to precomposed
decomposed = '\u0065\u0301'  # e + combining accent
composed = unicodedata.normalize('NFC', decomposed)
print(composed == 'é')  # True
print(ord(composed))    # 233 (U+00E9)

NFC is the preferred form for most purposes. It is what macOS and most web browsers produce when you type accented characters. It is compact (fewer code points than NFD) and widely expected.

NFD is useful for string processing tasks where you want to operate on base letters separately from their accents — for example, stripping all diacritics:

def remove_diacritics(text):
    """Remove all diacritical marks from text."""
    nfd = unicodedata.normalize('NFD', text)
    return ''.join(c for c in nfd
                   if unicodedata.category(c) != 'Mn')
    # Mn = Mark, Nonspacing (combining marks)

print(remove_diacritics("café résumé naïve"))
# 'cafe resume naive'

Compatibility Equivalence: NFKC and NFKD

Compatibility equivalence is a weaker relationship. Compatible characters look similar and can be substituted in some contexts, but they carry distinct semantic information. For example:

Ligature fi (ﬁ, U+FB01) is a typographic variant of the two-character sequence "fi" — same meaning, different glyph
Fullwidth A (Ａ, U+FF21) is the same letter as A but displayed in a wide CJK-style box
Superscript 2 (², U+00B2) is a compatibility form of the digit 2
Roman numeral IV (Ⅳ, U+2163) is a compatibility form of the letter sequence "IV"

NFKC and NFKD apply compatibility decomposition, which maps these variants to their simpler equivalents.

text_with_compat = "ﬁnd the ² factorial of Ⅳ in Ａ"

nfkc = unicodedata.normalize('NFKC', text_with_compat)
print(nfkc)
# 'find the 2 factorial of IV in A'

# fi ligature → f + i
# ² → 2
# Ⅳ → IV
# Ａ → A

NFKD and NFKC Compared

NFKD applies compatibility decomposition without re-composing. NFKC applies compatibility decomposition and then re-composes (same as NFC but with the broader compatibility decomposition step first).

# Combining both: NFKD then strip non-spacing marks = aggressive ASCII folding
def ascii_fold(text):
    """Map accented/variant chars toward ASCII equivalents."""
    nfkd = unicodedata.normalize('NFKD', text)
    return ''.join(c for c in nfkd
                   if not unicodedata.combining(c))

print(ascii_fold("ﬁnd ² × Ａ in café"))
# 'find 2 x A in cafe'

Warning: NFKC/NFKD are lossy. The transformation from ² to 2, or from ﬁ to fi, destroys information. Use them only in contexts where you genuinely do not care about these distinctions — search indexing, fuzzy matching, slug generation — not for storing or displaying text faithfully.

Canonical Ordering of Combining Marks

There is a subtlety in NFD beyond simple decomposition: combining characters must appear in canonical order. The canonical order is determined by the Canonical Combining Class (CCC) property, a number assigned to each combining character.

A sequence of combining characters attached to a base is sorted by CCC value in ascending order. This ensures that different orderings of the same combining marks produce the same NFD result.

# Two orderings of the same combining marks on 'a'
# CCC 230 = above, CCC 220 = below
s1 = 'a\u0308\u0325'   # a + diaeresis (above, CCC 230) + ring below (CCC 220)
s2 = 'a\u0325\u0308'   # a + ring below (CCC 220) + diaeresis (above, CCC 230)

print(s1 == s2)  # False — different byte order

# After NFD normalization, canonical order is applied
nfd1 = unicodedata.normalize('NFD', s1)
nfd2 = unicodedata.normalize('NFD', s2)
print(nfd1 == nfd2)  # True — same canonical form

# Get CCC values
print(unicodedata.combining('\u0308'))  # 230 (diaeresis, above)
print(unicodedata.combining('\u0325'))  # 220 (ring, below)

Practical Recommendations

Use NFC for Storage

When storing user-supplied text — in a database, a file, an API response — normalize to NFC first. This ensures: - Consistent comparison: name == stored_name will work correctly - Consistent indexing: Database indexes work correctly on NFC text - Predictable length: Fewer surprises with len() or column width constraints

import unicodedata

def normalize_input(text: str) -> str:
    """Normalize user input to NFC before storage."""
    return unicodedata.normalize('NFC', text.strip())

# In a web form handler:
username = normalize_input(request.POST['username'])

Use NFKC for Search and Comparison

For search, slug generation, or fuzzy matching where you want ② and 2 to match, or ﬁ and fi to match:

def search_normalize(text: str) -> str:
    """Normalize text for search index."""
    return unicodedata.normalize('NFKC', text).casefold()

# 'café', 'CAFÉ', 'café' (decomposed) all normalize to 'café'
# (casefold handles the case folding, NFKC handles composition variants)

JavaScript Normalization

JavaScript strings expose normalization via String.prototype.normalize():

// NFC normalization
const s1 = '\u00e9';           // precomposed é
const s2 = '\u0065\u0301';     // decomposed e + combining accent

console.log(s1 === s2);                      // false
console.log(s1.normalize('NFC') === s2.normalize('NFC'));  // true

// Common pattern: normalize before comparison
function normalizeForStorage(str) {
    return str.normalize('NFC').trim();
}

function normalizeForSearch(str) {
    return str.normalize('NFKC').toLowerCase();
}

// Strip diacritics (NFC fold approach)
function stripDiacritics(str) {
    return str.normalize('NFD')
              .replace(/[\u0300-\u036f]/g, '');
}

console.log(stripDiacritics('café résumé'));  // 'cafe resume'

Database Considerations

Most modern databases store text in Unicode, but their collation rules affect comparison. In PostgreSQL:

-- Check current database encoding
SELECT pg_encoding_to_char(encoding) FROM pg_database WHERE datname = current_database();

-- Unicode-aware case-insensitive search with ILIKE
SELECT * FROM users WHERE username ILIKE 'cafe%';

-- For proper normalization, normalize before insertion
-- (PostgreSQL does not auto-normalize; do it in application code)

In Python with Django, normalize before saving:

from django.db import models
import unicodedata

class UserProfile(models.Model):
    name = models.CharField(max_length=200)

    def save(self, *args, **kwargs):
        self.name = unicodedata.normalize('NFC', self.name)
        super().save(*args, **kwargs)

Confusables and Security

Normalization intersects with security through Unicode confusables — characters that look visually similar but are distinct code points. For example:

Latin a (U+0061) vs Cyrillic а (U+0430) — identical appearance in most fonts
Latin o (U+006F) vs Greek ο (U+03BF)
Latin c vs Cyrillic с, Latin e vs Cyrillic е

NFKC normalization does NOT resolve confusables — they are distinct code points with no compatibility relationship. Confusable detection requires additional tools like the Unicode Security Mechanisms (UTS #39).

However, normalization does prevent one class of attack: using multiple equivalent representations of the same character to bypass filters. If your code filters the string "<script>" but allows its NFD-decomposed form to pass through, you have a bypass vulnerability. Always normalize before processing or comparing security-sensitive strings.

# Safe pattern: normalize before filtering
def safe_check(user_input: str) -> bool:
    normalized = unicodedata.normalize('NFC', user_input)
    # Now compare, filter, or validate against the normalized form
    return '<script>' not in normalized.lower()

The Stability Guarantee

One final reassurance: Unicode normalization forms are stable. A string normalized to NFC in Unicode 6.0 is still valid NFC in Unicode 16.0. The Unicode Consortium guarantees that new character additions will not invalidate existing normalized strings.

This means you can normalize text once and store the normalized form, confident that it will remain normalized across Unicode version upgrades. The Unicode Version History article covers how Unicode manages these stability guarantees.

Summary

Form	Use When
NFC	Storage, display, file names (default choice for most uses)
NFD	Stripping diacritics, linguistic analysis of combining marks
NFKC	Search normalization, slug generation, fuzzy comparison
NFKD	Aggressive folding to base characters (lossy)

The key takeaway: always normalize user input to NFC before storing, indexing, or comparing. For search and fuzzy matching, consider NFKC. Never rely on raw byte comparison for Unicode strings without normalizing first.

Use our Character Counter to inspect the Unicode properties and normalization form of any text.

Next in Series: Unicode Properties and Categories: Classifying Every Character — Explore the metadata attached to every Unicode character, from General Category to Script and Bidi class.