What Is the General Category?
The General Category is one of the most fundamental properties in the Unicode Character Database (UCD). It classifies every Unicode code point into a two-letter category code that broadly describes the character's type and intended use. The General Category is used by text processing algorithms, regular expressions, and rendering engines to determine how a character should behave.
The categories are organized into major classes (indicated by the uppercase letter) and subcategories (indicated by the lowercase letter):
Category Hierarchy
Letters (L)
| Code | Name | Example |
|---|---|---|
Lu |
Uppercase Letter | A, Z, Ñ |
Ll |
Lowercase Letter | a, z, ñ |
Lt |
Titlecase Letter | Dž |
Lm |
Modifier Letter | ʰ (superscript h) |
Lo |
Other Letter | 中, ا (no case) |
Numbers (N)
| Code | Name | Example |
|---|---|---|
Nd |
Decimal Digit Number | 0–9 |
Nl |
Letter Number | Ⅻ (Roman numeral) |
No |
Other Number | ½, ² |
Symbols (S)
| Code | Name | Example |
|---|---|---|
Sm |
Math Symbol | +, =, ∑ |
Sc |
Currency Symbol | $, €, ¥ |
Sk |
Modifier Symbol | ^, ` |
So |
Other Symbol | ©, ™, emoji |
Punctuation (P), Marks (M), Separators (Z), and Others (C) complete the set.
Using General Category in Code
import unicodedata
chars = ['A', 'a', '5', '$', '中', '©', ' ', '\n']
for c in chars:
print(f'U+{ord(c):04X} {c!r:4} {unicodedata.category(c)}')
# U+0041 'A' Lu
# U+0061 'a' Ll
# U+0035 '5' Nd
# U+0024 '$' Sc
# U+4E2D '中' Lo
# U+00A9 '©' So
# U+0020 ' ' Zs
# U+000A '\n' Cc
// ES2018+ Unicode property escapes in regex
const isLetter = /^\p{L}$/u; // any letter
const isNumber = /^\p{N}$/u; // any number
const isCurrency = /^\p{Sc}$/u; // currency symbol
const isUppercase = /^\p{Lu}$/u; // uppercase letter
console.log(isLetter.test('A')); // true
console.log(isLetter.test('中')); // true
console.log(isCurrency.test('€')); // true
console.log(isNumber.test('½')); // true
Practical Applications
Input Validation
General categories enable language-agnostic input validation. Instead of hardcoding ASCII ranges, you can check \p{L} to match any letter in any script, making forms that correctly accept names written in Arabic, Chinese, or Cyrillic.
Word Boundary Detection
Text segmentation algorithms use general categories to identify word boundaries. Characters with category Ll, Lu, Lt, Lm, Lo, and Nd are typically considered word characters, while Z (separator) and P (punctuation) mark boundaries.
Regular Expression Character Classes
In Unicode-aware regex engines, \w is often defined as [\p{L}\p{N}_] — letters, numbers, and underscore — making it script-independent. Without General Category awareness, \w only matches [a-zA-Z0-9_].