SymbolFYI

Diacritical Mark

Typography

التعريف

A mark added to a letter to change its pronunciation or meaning (e.g., acute accent é, umlaut ü, tilde ñ).

A diacritical mark (also called a diacritic) is a glyph added to a base letter to modify its phonetic value, distinguish homographs, or indicate grammatical features. The word comes from Greek diakritikos (distinguishing). Diacritics appear in virtually every alphabetic writing system in the world and represent one of the more nuanced areas of Unicode text handling.

Common Diacritical Marks

Acute accent     e-acute  (U+0301 combining, or U+00E9 precomposed)
Grave accent     e-grave  (U+0300 combining, or U+00E8 precomposed)
Circumflex       e-circ   (U+0302 combining, or U+00EA precomposed)
Umlaut/diaeresis u-uml    (U+0308 combining, or U+00FC precomposed)
Tilde            n-tilde  (U+0303 combining, or U+00F1 precomposed)
Cedilla          c-cedil  (U+0327 combining, or U+00E7 precomposed)
Ring above       a-ring   (U+030A combining, or U+00E5 precomposed)
Macron           a-macron (U+0304 combining, or U+0101 precomposed)
Hacek/caron      c-caron  (U+030C combining, or U+010D precomposed)

Precomposed vs Decomposed Forms

Unicode offers two ways to represent the same accented character, and this is a common source of bugs:

Precomposed (NFC) -- a single code point represents the base letter plus diacritic:

e-acute = U+00E9 (one code point)

Decomposed (NFD) -- a base letter followed by a combining diacritical mark:

e-acute = U+0065 U+0301 (two code points: e + combining acute accent)

These look identical on screen but are different byte sequences. This matters for string length, indexing, sorting, and database storage:

const precomposed = '\u00E9';   // e-acute as single code point
const decomposed  = 'e\u0301'; // e + combining acute

console.log(precomposed === decomposed); // false!
console.log(precomposed.length);         // 1
console.log(decomposed.length);          // 2

// Normalize before comparing
console.log(
  precomposed.normalize('NFC') === decomposed.normalize('NFC')
); // true

Unicode Normalization Forms

// NFC -- Canonical Decomposition, followed by Canonical Composition (precomposed)
str.normalize('NFC');   // e-acute -> U+00E9

// NFD -- Canonical Decomposition only (decomposed)
str.normalize('NFD');   // e-acute -> e + U+0301

// NFKC -- Compatibility Decomposition, then Composition
str.normalize('NFKC');  // fi-ligature -> fi

// NFKD -- Compatibility Decomposition only
str.normalize('NFKD');

For most applications, normalize to NFC on input. For operations like stripping accents (creating slugs or search indexes), NFD is useful because diacritics become separate characters that can be filtered:

function removeAccents(str) {
  return str.normalize('NFD')
    .replace(/[\u0300-\u036f]/g, '');
}

removeAccents('Hello World with accents'); // removes accent marks

CSS and HTML Considerations

<!-- Always declare UTF-8 to ensure diacritics render correctly -->
<meta charset="UTF-8">

<!-- HTML entities as fallback -->
&eacute;  <!-- e-acute -->
&uuml;    <!-- u-uml -->
&ccedil;  <!-- c-cedil -->
&ntilde;  <!-- n-tilde -->

/* Hyphenation engines use lang attribute for diacritic-aware hyphenation */
html {
  lang: fr; /* or set on specific elements */
  hyphens: auto;
}

Sorting and Collation

Diacritic-aware sorting varies by locale. In Swedish, ae-variant sorts after z; in German, ae-variant sorts near a. The JavaScript Intl.Collator API handles this correctly:

const words = ['uber', 'apfel', 'aepfel', 'zug'];
words.sort(new Intl.Collator('de').compare);
// German collation order applied

المصطلحات ذات الصلة