Diacritical Marks: Understanding Accents, Umlauts, and Combining Characters

Reference Мар 12, 2024

Содержание

Accented letters are everywhere in written language: the é in "café," the ü in German "über," the ñ in Spanish "año," the ç in French "façade." These marks that sit above, below, or through a base letter are called diacritical marks, and they have a surprisingly complex existence in Unicode — sometimes stored as a single code point, sometimes as two, and always with implications for string comparison, search, and internationalization.

What Are Diacritical Marks?

A diacritical mark (also called a diacritic or accent mark) is a glyph added to or combined with a letter to alter its pronunciation, indicate stress, distinguish between words, or provide other linguistic information. The term comes from the Greek diakritikos, meaning "distinguishing."

Diacritics are used across most of the world's alphabetic writing systems:

Mark	Name	Example	Languages
´	Acute accent	é, á, í, ó, ú	French, Spanish, Portuguese, many others
`	Grave accent	è, à, ù	French, Italian
^	Circumflex	ê, â, î, ô, û	French, Romanian
~	Tilde	ñ, ã, õ	Spanish, Portuguese
¨	Umlaut / Diaeresis	ü, ö, ä, ë, ï	German, French, Turkish
¸	Cedilla	ç, ş	French, Turkish
ˇ	Caron (háček)	č, š, ž	Czech, Slovak, Slovenian
˙	Dot above	ż, ė	Polish, Lithuanian
̣	Dot below	ọ, ụ	Vietnamese, Igbo
̄	Macron	ā, ē, ī, ō, ū	Māori, Latin transliteration, Japanese romanization
̊	Ring above	å, ů	Swedish, Norwegian, Czech
̧	Hook below	ị, ụ	Vietnamese
̛	Horn	ơ, ư	Vietnamese

Precomposed vs Combining Characters

Here is the central technical fact about diacritical marks in Unicode: the same accented letter can be represented in two different ways.

Precomposed form (NFC)

A precomposed character is a single Unicode code point that encodes a base letter with its diacritic already combined. Unicode has precomposed most common accented letter combinations in the Latin Extended-A (U+0100–U+017F), Latin Extended-B (U+0180–U+024F), and Latin Extended Additional (U+1E00–U+1EFF) blocks.

é = U+00E9  LATIN SMALL LETTER E WITH ACUTE  (1 code point)
ü = U+00FC  LATIN SMALL LETTER U WITH DIAERESIS  (1 code point)
ñ = U+00F1  LATIN SMALL LETTER N WITH TILDE  (1 code point)
ç = U+00E7  LATIN SMALL LETTER C WITH CEDILLA  (1 code point)
ā = U+0101  LATIN SMALL LETTER A WITH MACRON  (1 code point)

Decomposed form (NFD)

A decomposed character is the same letter represented as a base character followed by one or more combining diacritical mark code points:

é = U+0065 (e) + U+0301 (combining acute accent)  (2 code points)
ü = U+0075 (u) + U+0308 (combining diaeresis)     (2 code points)
ñ = U+006E (n) + U+0303 (combining tilde)          (2 code points)
ç = U+0063 (c) + U+0327 (combining cedilla)        (2 code points)

Both forms are valid Unicode and render identically. However, they are not byte-equal, which causes many bugs.

The Combining Diacritical Marks Block

The primary block for combining diacritical marks is U+0300–U+036F. These 112 code points are "floating" marks that combine with the preceding base character. They are also called combining characters.

Selected combining diacritical marks:

Code Point	Name	Example result
U+0300	COMBINING GRAVE ACCENT	à
U+0301	COMBINING ACUTE ACCENT	á
U+0302	COMBINING CIRCUMFLEX ACCENT	â
U+0303	COMBINING TILDE	ã
U+0304	COMBINING MACRON	ā
U+0306	COMBINING BREVE	ă
U+0307	COMBINING DOT ABOVE	ȧ
U+0308	COMBINING DIAERESIS	ä
U+0309	COMBINING HOOK ABOVE	ả
U+030A	COMBINING RING ABOVE	å
U+030B	COMBINING DOUBLE ACUTE ACCENT	ő
U+030C	COMBINING CARON	ǎ
U+0323	COMBINING DOT BELOW	ạ
U+0324	COMBINING DIAERESIS BELOW	ḁ
U+0325	COMBINING RING BELOW	ḁ
U+0327	COMBINING CEDILLA	ç
U+0328	COMBINING OGONEK	ą
U+0331	COMBINING MACRON BELOW	ḇ

Additional combining marks appear in other blocks: Combining Diacritical Marks Supplement (U+1DC0–U+1DFF), Combining Diacritical Marks for Symbols (U+20D0–U+20FF), and Combining Half Marks (U+FE20–U+FE2F).

Multiple combining marks

A single base character can carry multiple combining marks, which is common in Vietnamese and phonetic transcription:

ể = e + U+0302 (circumflex) + U+0309 (hook above) — 3 code points in NFD
ộ = o + U+0302 (circumflex) + U+0323 (dot below)  — 3 code points in NFD

Vietnamese uses up to 3 code points per syllable in decomposed form (base vowel + tone mark + additional diacritic).

Normalization: NFC and NFD

Unicode defines four normalization forms. For diacritical marks, the two most relevant are:

NFC (Canonical Decomposition, followed by Canonical Composition) Produces the shortest representation. Prefers precomposed forms when they exist. This is the default on the web, in most databases, and in macOS.

NFD (Canonical Decomposition) Decomposes all precomposed characters into base + combining marks. This is the default on older macOS HFS+ file system paths (which caused many cross-platform filename bugs).

import unicodedata

s1 = 'café'                   # NFC (precomposed é)
s2 = 'cafe\u0301'             # NFD (e + combining acute)

# They look identical but are not equal
print(s1 == s2)               # False
print(len(s1))                # 4
print(len(s2))                # 5

# Normalize to compare
nfc1 = unicodedata.normalize('NFC', s1)
nfc2 = unicodedata.normalize('NFC', s2)
print(nfc1 == nfc2)           # True

const s1 = 'café';           // NFC
const s2 = 'cafe\u0301';     // NFD

s1 === s2                    // false
s1.length                    // 4
s2.length                    // 5

s1.normalize('NFC') === s2.normalize('NFC')  // true

Best practice: Always normalize to NFC before comparing, storing, or searching text that may contain diacritics. Most databases and web frameworks expect NFC input.

Typing Diacritical Marks

macOS

macOS offers a press-and-hold method for common diacritics: hold down a vowel or consonant key (a, e, i, o, u, c, n, s, z) to see a popup of available accented variants. Click or press the number shown to insert the variant.

For dead-key input, use the US International or ABC Extended keyboard layout: - Dead keys: ` ´ ^ ~ ¨ - Press the dead key, then the base letter: ´ + e → é - Press the dead key twice or followed by Space to get the mark itself

For comprehensive access: System Settings → Keyboard → Input Sources → ABC Extended provides the widest dead-key coverage.

Windows

On Windows, the approach depends on the keyboard layout:

US International layout: Dead keys for common diacritics (Right Alt for additional chars)
Alt codes: Hold Alt + numeric keypad codes (e.g., Alt+0233 for é)
Character Map (charmap.exe): Browse and copy any Unicode character
Touch keyboard: Tap and hold on vowels for accent options on tablet mode

Linux

Most Linux desktop environments support the Compose key method:

Compose + ' + e = é  (acute accent)
Compose + ` + e = è  (grave accent)
Compose + ^ + e = ê  (circumflex)
Compose + " + u = ü  (umlaut/diaeresis)
Compose + ~ + n = ñ  (tilde)
Compose + , + c = ç  (cedilla)

The Compose key is typically mapped to Right Alt, Menu key, or Caps Lock via keyboard layout settings.

Stripping Diacritics in Code

Removing diacritics from text is a common need for: - Text search normalization: Allow users to search for "cafe" and find "café" - Slug generation: Convert "Ñoño García" to "nono-garcia" for a URL - ASCII-safe output: Generate filenames, identifiers, or database keys without accented characters - Fuzzy matching: Find names regardless of accent mark usage

The standard technique is to normalize to NFD, then remove all combining characters:

Python

import unicodedata

def strip_diacritics(text: str) -> str:
    # Decompose to NFD, then remove all combining marks
    nfd = unicodedata.normalize('NFD', text)
    return ''.join(ch for ch in nfd if unicodedata.category(ch) != 'Mn')

strip_diacritics('café')         # 'cafe'
strip_diacritics('naïve')        # 'naive'
strip_diacritics('Ñoño García')  # 'Nono Garcia'
strip_diacritics('über')         # 'uber'
strip_diacritics('Ångström')     # 'Angstrom'

unicodedata.category(ch) != 'Mn' filters out characters whose Unicode category is "Mark, Nonspacing" — which is exactly what combining diacritical marks are classified as.

JavaScript

function stripDiacritics(str) {
  return str.normalize('NFD').replace(/[\u0300-\u036f]/g, '');
}

stripDiacritics('café')         // 'cafe'
stripDiacritics('naïve')        // 'naive'
stripDiacritics('Ñoño García')  // 'Nono Garcia'

Important caveats when stripping

Some characters defy simple stripping: - ø (U+00F8, LATIN SMALL LETTER O WITH STROKE) — this is not base o + combining stroke; it is a distinct letter in Scandinavian alphabets. NFD does not decompose it. - ß (U+00DF, LATIN SMALL LETTER SHARP S) — a German letter, not a diacritic form of s. - Đ (U+0110, LATIN CAPITAL LETTER D WITH STROKE) — a distinct letter in Vietnamese and South Slavic languages. - æ, œ — ligatures, not base letters with diacritics.

Simple NFD-based stripping handles most Latin diacritics correctly, but for accurate transliteration (especially for names), a purpose-built transliteration library (like transliterate in Python or latinize in JavaScript) handles these edge cases.

Zalgo Text: Diacritics Gone Extreme

Combining characters have no limit on how many can follow a base character. This enables "Zalgo text" — text with dozens or hundreds of combining marks stacked on each letter, creating a dramatic distortion effect:

H̷̡̤̙̖̺̦̟̣̙͚͉̳̐̈́̿̄͂̿̿̌̎͗̅͘͝ͅE̷̢͎̠̻̮̖͎͓̦̪̯̥̲̩͌̓͒̒̅́͝ͅ

Each of those "H" and "E" letters may have 20–60 combining marks applied to it. The result is valid Unicode (all those code points are legitimate) but intentionally abusive.

For applications that accept user-generated text, you may want to limit the number of consecutive combining characters:

import unicodedata
import re

def sanitize_zalgo(text: str, max_combining: int = 2) -> str:
    result = []
    combining_count = 0
    for ch in text:
        if unicodedata.category(ch) == 'Mn':
            combining_count += 1
            if combining_count <= max_combining:
                result.append(ch)
        else:
            combining_count = 0
            result.append(ch)
    return ''.join(result)

Quick Reference: Common Accented Characters

Character	Code Point	HTML Entity	Description
à	U+00E0	`à`	a with grave
á	U+00E1	`á`	a with acute
â	U+00E2	`â`	a with circumflex
ã	U+00E3	`ã`	a with tilde
ä	U+00E4	`ä`	a with diaeresis
å	U+00E5	`å`	a with ring above
ç	U+00E7	`ç`	c with cedilla
è	U+00E8	`è`	e with grave
é	U+00E9	`é`	e with acute
ê	U+00EA	`ê`	e with circumflex
ë	U+00EB	`ë`	e with diaeresis
ñ	U+00F1	`ñ`	n with tilde
ö	U+00F6	`ö`	o with diaeresis
ü	U+00FC	`ü`	u with diaeresis

Use the SymbolFYI Unicode Lookup tool to find any diacritical mark or accented character by name or code point, and the Character Counter tool to inspect whether your string uses precomposed or decomposed forms.