SymbolFYI

General Category

Unicode Standard

Définition

A Unicode property that classifies each character (e.g., Lu = Uppercase Letter, Sm = Math Symbol, So = Other Symbol).

What Is the General Category?

The General Category is one of the most fundamental properties in the Unicode Character Database (UCD). It classifies every Unicode code point into a two-letter category code that broadly describes the character's type and intended use. The General Category is used by text processing algorithms, regular expressions, and rendering engines to determine how a character should behave.

The categories are organized into major classes (indicated by the uppercase letter) and subcategories (indicated by the lowercase letter):

Category Hierarchy

Letters (L)

Code	Name	Example
`Lu`	Uppercase Letter	A, Z, Ñ
`Ll`	Lowercase Letter	a, z, ñ
`Lt`	Titlecase Letter	Dž
`Lm`	Modifier Letter	ʰ (superscript h)
`Lo`	Other Letter	中, ا (no case)

Numbers (N)

Code	Name	Example
`Nd`	Decimal Digit Number	0–9
`Nl`	Letter Number	Ⅻ (Roman numeral)
`No`	Other Number	½, ²

Symbols (S)

Code	Name	Example
`Sm`	Math Symbol	+, =, ∑
`Sc`	Currency Symbol	$, €, ¥
`Sk`	Modifier Symbol	^, `
`So`	Other Symbol	©, ™, emoji

Punctuation (P), Marks (M), Separators (Z), and Others (C) complete the set.

Using General Category in Code

import unicodedata

chars = ['A', 'a', '5', '$', '中', '©', ' ', '\n']
for c in chars:
    print(f'U+{ord(c):04X}  {c!r:4}  {unicodedata.category(c)}')
# U+0041  'A'   Lu
# U+0061  'a'   Ll
# U+0035  '5'   Nd
# U+0024  '$'   Sc
# U+4E2D  '中'  Lo
# U+00A9  '©'   So
# U+0020  ' '   Zs
# U+000A  '\n'  Cc

// ES2018+ Unicode property escapes in regex
const isLetter = /^\p{L}$/u;        // any letter
const isNumber = /^\p{N}$/u;        // any number
const isCurrency = /^\p{Sc}$/u;     // currency symbol
const isUppercase = /^\p{Lu}$/u;    // uppercase letter

console.log(isLetter.test('A'));    // true
console.log(isLetter.test('中'));   // true
console.log(isCurrency.test('€'));  // true
console.log(isNumber.test('½'));    // true

Practical Applications

Input Validation

General categories enable language-agnostic input validation. Instead of hardcoding ASCII ranges, you can check \p{L} to match any letter in any script, making forms that correctly accept names written in Arabic, Chinese, or Cyrillic.

Word Boundary Detection

Text segmentation algorithms use general categories to identify word boundaries. Characters with category Ll, Lu, Lt, Lm, Lo, and Nd are typically considered word characters, while Z (separator) and P (punctuation) mark boundaries.

Regular Expression Character Classes

In Unicode-aware regex engines, \w is often defined as [\p{L}\p{N}_] — letters, numbers, and underscore — making it script-independent. Without General Category awareness, \w only matches [a-zA-Z0-9_].

Termes associés