Glossar
Unicode-, Kodierungs- und Typografie-Begriffe erklärt.
Code Point vs Character vs Glyph
Understanding the three abstraction levels: a code point (number), a character (abstract), and a glyph (visual rendering).
Programming & DevEncoding Detection
Techniques for detecting the character encoding of text files, including BOM sniffing, heuristics, and chardet libraries.
Programming & DevGrapheme Segmentation (UAX #29)
The Unicode algorithm for splitting text into user-perceived characters, handling emoji sequences, combining marks, etc.
Programming & DevIDN Homograph Attack
A phishing technique using visually similar Unicode characters in domain names to impersonate legitimate sites.
Programming & DevJavaScript String & Code Points
JS String methods for Unicode: codePointAt(), String.fromCodePoint(), and the spread operator for grapheme iteration.
Programming & DevPython unicodedata Module
Python standard library module for looking up Unicode character names, categories, and properties.
Programming & DevRegex Unicode Support
Using Unicode-aware regular expressions with flags like /u in JS and re.UNICODE in Python.
Programming & DevString Length vs Character Count
Why str.length in JavaScript returns UTF-16 code units, not visual characters — and how to count graphemes correctly.
Programming & DevUnicode Collation
Sorting text according to language-specific rules using the Unicode Collation Algorithm (UCA, UTS #10).
Programming & DevUnicode Property Escapes (\p{})
Regex syntax (\p{Script=Greek}, \p{Letter}) that matches characters by Unicode properties. Supported in JS, Java, Python 3.8+.
Programming & DevUnicode Sandwich Pattern
A programming best practice: decode bytes → process text as Unicode → encode bytes. Keeps Unicode in the middle.
Programming & DevUnicode in URLs & IRIs
How Unicode characters in URLs are handled: IRI (RFC 3987), percent-encoding of UTF-8 bytes, and browser display.
Programming & Dev