Diacritical Marks: Understanding Accents, Umlauts, and Combining Characters
Accented letters are everywhere in written language: the é in "café," the ü in German "über," the ñ in Spanish "año," the ç in French "façade." These marks that sit above, below, or through a base letter are called diacritical marks, and they have a surprisingly complex existence in Unicode — sometimes stored as a single code point, sometimes as two, and always with implications for string comparison, search, and internationalization.
What Are Diacritical Marks?
A diacritical mark (also called a diacritic or accent mark) is a glyph added to or combined with a letter to alter its pronunciation, indicate stress, distinguish between words, or provide other linguistic information. The term comes from the Greek diakritikos, meaning "distinguishing."
Diacritics are used across most of the world's alphabetic writing systems:
| Mark | Name | Example | Languages |
|---|---|---|---|
| ´ | Acute accent | é, á, í, ó, ú | French, Spanish, Portuguese, many others |
| ` | Grave accent | è, à, ù | French, Italian |
| ^ | Circumflex | ê, â, î, ô, û | French, Romanian |
| ~ | Tilde | ñ, ã, õ | Spanish, Portuguese |
| ¨ | Umlaut / Diaeresis | ü, ö, ä, ë, ï | German, French, Turkish |
| ¸ | Cedilla | ç, ş | French, Turkish |
| ˇ | Caron (háček) | č, š, ž | Czech, Slovak, Slovenian |
| ˙ | Dot above | ż, ė | Polish, Lithuanian |
| ̣ | Dot below | ọ, ụ | Vietnamese, Igbo |
| ̄ | Macron | ā, ē, ī, ō, ū | Māori, Latin transliteration, Japanese romanization |
| ̊ | Ring above | å, ů | Swedish, Norwegian, Czech |
| ̧ | Hook below | ị, ụ | Vietnamese |
| ̛ | Horn | ơ, ư | Vietnamese |
Precomposed vs Combining Characters
Here is the central technical fact about diacritical marks in Unicode: the same accented letter can be represented in two different ways.
Precomposed form (NFC)
A precomposed character is a single Unicode code point that encodes a base letter with its diacritic already combined. Unicode has precomposed most common accented letter combinations in the Latin Extended-A (U+0100–U+017F), Latin Extended-B (U+0180–U+024F), and Latin Extended Additional (U+1E00–U+1EFF) blocks.
é = U+00E9 LATIN SMALL LETTER E WITH ACUTE (1 code point)
ü = U+00FC LATIN SMALL LETTER U WITH DIAERESIS (1 code point)
ñ = U+00F1 LATIN SMALL LETTER N WITH TILDE (1 code point)
ç = U+00E7 LATIN SMALL LETTER C WITH CEDILLA (1 code point)
ā = U+0101 LATIN SMALL LETTER A WITH MACRON (1 code point)
Decomposed form (NFD)
A decomposed character is the same letter represented as a base character followed by one or more combining diacritical mark code points:
é = U+0065 (e) + U+0301 (combining acute accent) (2 code points)
ü = U+0075 (u) + U+0308 (combining diaeresis) (2 code points)
ñ = U+006E (n) + U+0303 (combining tilde) (2 code points)
ç = U+0063 (c) + U+0327 (combining cedilla) (2 code points)
Both forms are valid Unicode and render identically. However, they are not byte-equal, which causes many bugs.
The Combining Diacritical Marks Block
The primary block for combining diacritical marks is U+0300–U+036F. These 112 code points are "floating" marks that combine with the preceding base character. They are also called combining characters.
Selected combining diacritical marks:
| Code Point | Name | Example result |
|---|---|---|
| U+0300 | COMBINING GRAVE ACCENT | à |
| U+0301 | COMBINING ACUTE ACCENT | á |
| U+0302 | COMBINING CIRCUMFLEX ACCENT | â |
| U+0303 | COMBINING TILDE | ã |
| U+0304 | COMBINING MACRON | ā |
| U+0306 | COMBINING BREVE | ă |
| U+0307 | COMBINING DOT ABOVE | ȧ |
| U+0308 | COMBINING DIAERESIS | ä |
| U+0309 | COMBINING HOOK ABOVE | ả |
| U+030A | COMBINING RING ABOVE | å |
| U+030B | COMBINING DOUBLE ACUTE ACCENT | ő |
| U+030C | COMBINING CARON | ǎ |
| U+0323 | COMBINING DOT BELOW | ạ |
| U+0324 | COMBINING DIAERESIS BELOW | ḁ |
| U+0325 | COMBINING RING BELOW | ḁ |
| U+0327 | COMBINING CEDILLA | ç |
| U+0328 | COMBINING OGONEK | ą |
| U+0331 | COMBINING MACRON BELOW | ḇ |
Additional combining marks appear in other blocks: Combining Diacritical Marks Supplement (U+1DC0–U+1DFF), Combining Diacritical Marks for Symbols (U+20D0–U+20FF), and Combining Half Marks (U+FE20–U+FE2F).
Multiple combining marks
A single base character can carry multiple combining marks, which is common in Vietnamese and phonetic transcription:
ể = e + U+0302 (circumflex) + U+0309 (hook above) — 3 code points in NFD
ộ = o + U+0302 (circumflex) + U+0323 (dot below) — 3 code points in NFD
Vietnamese uses up to 3 code points per syllable in decomposed form (base vowel + tone mark + additional diacritic).
Normalization: NFC and NFD
Unicode defines four normalization forms. For diacritical marks, the two most relevant are:
NFC (Canonical Decomposition, followed by Canonical Composition) Produces the shortest representation. Prefers precomposed forms when they exist. This is the default on the web, in most databases, and in macOS.
NFD (Canonical Decomposition) Decomposes all precomposed characters into base + combining marks. This is the default on older macOS HFS+ file system paths (which caused many cross-platform filename bugs).
import unicodedata
s1 = 'café' # NFC (precomposed é)
s2 = 'cafe\u0301' # NFD (e + combining acute)
# They look identical but are not equal
print(s1 == s2) # False
print(len(s1)) # 4
print(len(s2)) # 5
# Normalize to compare
nfc1 = unicodedata.normalize('NFC', s1)
nfc2 = unicodedata.normalize('NFC', s2)
print(nfc1 == nfc2) # True
const s1 = 'café'; // NFC
const s2 = 'cafe\u0301'; // NFD
s1 === s2 // false
s1.length // 4
s2.length // 5
s1.normalize('NFC') === s2.normalize('NFC') // true
Best practice: Always normalize to NFC before comparing, storing, or searching text that may contain diacritics. Most databases and web frameworks expect NFC input.
Typing Diacritical Marks
macOS
macOS offers a press-and-hold method for common diacritics: hold down a vowel or consonant key (a, e, i, o, u, c, n, s, z) to see a popup of available accented variants. Click or press the number shown to insert the variant.
For dead-key input, use the US International or ABC Extended keyboard layout:
- Dead keys: ` ´ ^ ~ ¨
- Press the dead key, then the base letter: ´ + e → é
- Press the dead key twice or followed by Space to get the mark itself
For comprehensive access: System Settings → Keyboard → Input Sources → ABC Extended provides the widest dead-key coverage.
Windows
On Windows, the approach depends on the keyboard layout:
- US International layout: Dead keys for common diacritics (Right Alt for additional chars)
- Alt codes: Hold Alt + numeric keypad codes (e.g., Alt+0233 for é)
- Character Map (charmap.exe): Browse and copy any Unicode character
- Touch keyboard: Tap and hold on vowels for accent options on tablet mode
Linux
Most Linux desktop environments support the Compose key method:
Compose + ' + e = é (acute accent)
Compose + ` + e = è (grave accent)
Compose + ^ + e = ê (circumflex)
Compose + " + u = ü (umlaut/diaeresis)
Compose + ~ + n = ñ (tilde)
Compose + , + c = ç (cedilla)
The Compose key is typically mapped to Right Alt, Menu key, or Caps Lock via keyboard layout settings.
Stripping Diacritics in Code
Removing diacritics from text is a common need for: - Text search normalization: Allow users to search for "cafe" and find "café" - Slug generation: Convert "Ñoño García" to "nono-garcia" for a URL - ASCII-safe output: Generate filenames, identifiers, or database keys without accented characters - Fuzzy matching: Find names regardless of accent mark usage
The standard technique is to normalize to NFD, then remove all combining characters:
Python
import unicodedata
def strip_diacritics(text: str) -> str:
# Decompose to NFD, then remove all combining marks
nfd = unicodedata.normalize('NFD', text)
return ''.join(ch for ch in nfd if unicodedata.category(ch) != 'Mn')
strip_diacritics('café') # 'cafe'
strip_diacritics('naïve') # 'naive'
strip_diacritics('Ñoño García') # 'Nono Garcia'
strip_diacritics('über') # 'uber'
strip_diacritics('Ångström') # 'Angstrom'
unicodedata.category(ch) != 'Mn' filters out characters whose Unicode category is "Mark, Nonspacing" — which is exactly what combining diacritical marks are classified as.
JavaScript
function stripDiacritics(str) {
return str.normalize('NFD').replace(/[\u0300-\u036f]/g, '');
}
stripDiacritics('café') // 'cafe'
stripDiacritics('naïve') // 'naive'
stripDiacritics('Ñoño García') // 'Nono Garcia'
Important caveats when stripping
Some characters defy simple stripping: - ø (U+00F8, LATIN SMALL LETTER O WITH STROKE) — this is not base o + combining stroke; it is a distinct letter in Scandinavian alphabets. NFD does not decompose it. - ß (U+00DF, LATIN SMALL LETTER SHARP S) — a German letter, not a diacritic form of s. - Đ (U+0110, LATIN CAPITAL LETTER D WITH STROKE) — a distinct letter in Vietnamese and South Slavic languages. - æ, œ — ligatures, not base letters with diacritics.
Simple NFD-based stripping handles most Latin diacritics correctly, but for accurate transliteration (especially for names), a purpose-built transliteration library (like transliterate in Python or latinize in JavaScript) handles these edge cases.
Zalgo Text: Diacritics Gone Extreme
Combining characters have no limit on how many can follow a base character. This enables "Zalgo text" — text with dozens or hundreds of combining marks stacked on each letter, creating a dramatic distortion effect:
H̷̡̤̙̖̺̦̟̣̙͚͉̳̐̈́̿̄͂̿̿̌̎͗̅͘͝ͅE̷̢͎̠̻̮̖͎͓̦̪̯̥̲̩͌̓͒̒̅́͝ͅ
Each of those "H" and "E" letters may have 20–60 combining marks applied to it. The result is valid Unicode (all those code points are legitimate) but intentionally abusive.
For applications that accept user-generated text, you may want to limit the number of consecutive combining characters:
import unicodedata
import re
def sanitize_zalgo(text: str, max_combining: int = 2) -> str:
result = []
combining_count = 0
for ch in text:
if unicodedata.category(ch) == 'Mn':
combining_count += 1
if combining_count <= max_combining:
result.append(ch)
else:
combining_count = 0
result.append(ch)
return ''.join(result)
Quick Reference: Common Accented Characters
| Character | Code Point | HTML Entity | Description |
|---|---|---|---|
| à | U+00E0 | à |
a with grave |
| á | U+00E1 | á |
a with acute |
| â | U+00E2 | â |
a with circumflex |
| ã | U+00E3 | ã |
a with tilde |
| ä | U+00E4 | ä |
a with diaeresis |
| å | U+00E5 | å |
a with ring above |
| ç | U+00E7 | ç |
c with cedilla |
| è | U+00E8 | è |
e with grave |
| é | U+00E9 | é |
e with acute |
| ê | U+00EA | ê |
e with circumflex |
| ë | U+00EB | ë |
e with diaeresis |
| ñ | U+00F1 | ñ |
n with tilde |
| ö | U+00F6 | ö |
o with diaeresis |
| ü | U+00FC | ü |
u with diaeresis |
Use the SymbolFYI Unicode Lookup tool to find any diacritical mark or accented character by name or code point, and the Character Counter tool to inspect whether your string uses precomposed or decomposed forms.