Unicode Properties and Categories: Classifying Every Character
- ○ 1. What Is Unicode? The Universal Character Standard Explained
- ○ 2. Unicode Planes and Blocks: How 1.1 Million Code Points Are Organized
- ○ 3. Unicode Encodings Explained: UTF-8, UTF-16, and UTF-32 Compared
- ○ 4. Unicode Normalization: NFC, NFD, NFKC, and NFKD Explained
- ● 5. Unicode Properties and Categories: Classifying Every Character
- ○ 6. Bidirectional Text in Unicode: How RTL and LTR Scripts Coexist
- ○ 7. How Emoji Work in Unicode: From Code Points to Skin Tones
- ○ 8. CJK Unification: How Unicode Handles Chinese, Japanese, and Korean
- ○ 9. Unicode Version History: From 1.0 to 16.0 and Beyond
- ○ 10. Unicode CLDR: The Database Behind Every Localized App
Every Unicode character carries metadata far beyond its visual appearance. Each of the 154,000+ assigned characters has dozens of properties: its general category, which script it belongs to, whether it is a letter or digit, how it behaves in bidirectional text, whether it is case-foldable, and much more. These properties are the engine behind internationalized text processing — the reason a regex like \p{Letter} matches क and ш and ψ and 가 as naturally as it matches A.
The Unicode Character Database
All character properties are defined in the Unicode Character Database (UCD), a set of machine-readable data files maintained by the Unicode Consortium. The UCD is the authoritative source that programming languages, operating systems, and text processing libraries implement.
The most important data files include:
- UnicodeData.txt — basic properties for every character (name, category, numeric values, decomposition)
- Scripts.txt — which script each character belongs to
- DerivedCoreProperties.txt — derived properties like "Alphabetic" and "ID_Start"
- PropList.txt — miscellaneous binary properties
- CaseFolding.txt — case folding data
- BidirectionalData.txt — bidi properties
You can access all of this from the Unicode website or from your programming language's standard library.
General Category
The General Category is the most fundamental character property. Every Unicode character is assigned exactly one General Category, which classifies its basic type.
Category Table
| Code | Name | Examples |
|---|---|---|
| Letter | ||
| Lu | Uppercase Letter | A, B, Ñ, Ü, Ж |
| Ll | Lowercase Letter | a, b, ñ, ü, ж |
| Lt | Titlecase Letter | Dž, Lj (digraphs) |
| Lm | Modifier Letter | ʰ, ʲ (phonetic modifiers) |
| Lo | Other Letter | 中, あ, 가, ؟ (letters without case) |
| Mark | ||
| Mn | Nonspacing Mark | Combining accent ́, combining diaeresis ̈ |
| Mc | Spacing Combining Mark | Devanagari vowel signs |
| Me | Enclosing Mark | Combining enclosing circle |
| Number | ||
| Nd | Decimal Digit Number | 0–9, ٠–٩ (Arabic-Indic), ०–९ (Devanagari) |
| Nl | Letter Number | Ⅳ, ⅿ (Roman numerals in letter form) |
| No | Other Number | ², ½, ② |
| Punctuation | ||
| Pc | Connector Punctuation | _ (underscore) |
| Pd | Dash Punctuation | -, –, — |
| Ps | Open Punctuation | (, [, { |
| Pe | Close Punctuation | ), ], } |
| Pi | Initial Quote Punctuation | ", ' (opening quotes) |
| Pf | Final Quote Punctuation | ", ' (closing quotes) |
| Po | Other Punctuation | !, ?, ., , |
| Symbol | ||
| Sm | Math Symbol | +, =, ×, ∑, ∫ |
| Sc | Currency Symbol | $, €, ¥, £, ₿ |
| Sk | Modifier Symbol | ^, `, ˆ (spacing modifier letters) |
| So | Other Symbol | ©, ®, ♠, ✓, 😀 |
| Separator | ||
| Zs | Space Separator | Space, non-breaking space, em space |
| Zl | Line Separator | U+2028 |
| Zp | Paragraph Separator | U+2029 |
| Other | ||
| Cc | Control | Tab, newline, null, DEL |
| Cf | Format | Zero-width joiner, BOM, soft hyphen |
| Cs | Surrogate | U+D800–U+DFFF |
| Co | Private Use | U+E000–U+F8FF, planes 15–16 |
| Cn | Unassigned | All unassigned code points |
Accessing General Category in Python
import unicodedata
chars = ['A', 'a', '3', '中', '!', '+', '$', '😀', ' ', '\u0301']
for c in chars:
cat = unicodedata.category(c)
name = unicodedata.name(c, '(no name)')
print(f'U+{ord(c):04X} {cat} {name}')
# U+0041 Lu LATIN CAPITAL LETTER A
# U+0061 Ll LATIN SMALL LETTER A
# U+0033 Nd DIGIT THREE
# U+4E2D Lo CJK UNIFIED IDEOGRAPH-4E2D
# U+0021 Po EXCLAMATION MARK
# U+002B Sm PLUS SIGN
# U+0024 Sc DOLLAR SIGN
# U+1F600 So GRINNING FACE
# U+0020 Zs SPACE
# U+0301 Mn COMBINING ACUTE ACCENT
Using Categories for Text Processing
import unicodedata
def count_by_category(text):
"""Count characters by their Unicode general category."""
counts = {}
for c in text:
cat = unicodedata.category(c)
counts[cat] = counts.get(cat, 0) + 1
return counts
sample = "Hello, 世界! 42 + $3.14 = π"
print(count_by_category(sample))
# {'Lu': 1, 'Ll': 4, 'Po': 3, 'Zs': 5, 'Lo': 2, 'Nd': 3,
# 'Sm': 2, 'Sc': 1, 'Ll': 1}
# Extract only letters
letters_only = ''.join(c for c in text
if unicodedata.category(c).startswith('L'))
The Script Property
While General Category tells you what a character is, the Script property tells you which writing system it belongs to. The Script property is essential for language detection, font selection, and text segmentation.
Common script values include: Latin, Greek, Cyrillic, Arabic, Hebrew, Devanagari, Bengali, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Tibetan, Georgian, Hangul, Hiragana, Katakana, Han (CJK), and about 170 more.
Some characters have script value Common (punctuation, digits, symbols used across scripts) or Inherited (combining marks that inherit the script of the base character they attach to).
# Python 3.x doesn't expose Script directly from unicodedata
# Use the 'regex' package for script-aware matching
import regex
# Match any Arabic letter
arabic_pattern = regex.compile(r'\p{Script=Arabic}+')
text = "مرحبا بالعالم Hello مرحبا"
matches = arabic_pattern.findall(text)
print(matches) # ['مرحبا', 'بالعالم', 'مرحبا']
# Match any Han (CJK) character
han_pattern = regex.compile(r'\p{Script=Han}+')
text2 = "Hello 世界 World 日本語"
print(han_pattern.findall(text2)) # ['世界', '日本語']
Script in JavaScript
ECMAScript 2018 introduced Unicode property escapes in regular expressions. The \p{...} syntax gives direct access to Unicode properties:
// Script matching
const arabicRE = /\p{Script=Arabic}+/gu;
const text = "مرحبا بالعالم Hello";
console.log(text.match(arabicRE)); // ['مرحبا', 'بالعالم']
// General category matching
const letterRE = /\p{Letter}+/gu;
const text2 = "Hello 42 世界 3.14";
console.log(text2.match(letterRE)); // ['Hello', '世界']
// Decimal digits in any script
const digitRE = /\p{Decimal_Number}+/gu;
const text3 = "42 ١٢٣ ৪৫৬"; // ASCII, Arabic-Indic, Bengali
console.log(text3.match(digitRE)); // ['42', '١٢٣', '৪৫৬']
The u flag is required for Unicode mode in JavaScript regex, and the g flag enables global matching. Without u, \p is treated as a literal character class.
Derived Properties
Beyond the basic UCD properties, Unicode defines a number of derived properties computed from combinations of basic properties. These are the ones you most commonly use in practice:
Alphabetic
A character is Alphabetic if it is a letter or a combining mark that appears in words. This includes: - All Ll, Lu, Lt, Lm, Lo characters - Some Mn/Mc combining marks that function as letters in their scripts - Number letters (Nl like Roman numerals)
// \p{Alphabetic} matches letters across all scripts
const alphaRE = /\p{Alphabetic}+/gu;
"café αβγ مرحبا 漢字".match(alphaRE);
// ['café', 'αβγ', 'مرحبا', '漢字']
ID_Start and ID_Continue
These properties define which characters can appear at the start or continuation of an identifier — used by programming language lexers to determine valid variable names.
import unicodedata
# Characters that can start a Python identifier
def can_start_identifier(c):
cat = unicodedata.category(c)
return cat in ('Lu', 'Ll', 'Lt', 'Lm', 'Lo', 'Nl') or c == '_'
# Python identifiers can use Unicode letters
# This is valid Python 3:
# π = 3.14159
# 変数 = "variable"
# café = "coffee shop"
Uppercase, Lowercase, Cased
import unicodedata
# unicodedata provides these as functions
print(unicodedata.category('A')) # Lu (Uppercase Letter)
print(unicodedata.category('a')) # Ll (Lowercase Letter)
# Check case properties
print('A'.isupper()) # True
print('a'.islower()) # True
print('中'.isupper()) # False
print('中'.islower()) # False — CJK has no case
# Case conversion works across scripts
print('ñoño'.upper()) # 'ÑOÑO'
print('ΑΘΗΝΑ'.lower()) # 'αθηνα' (Greek)
print('МОСКВА'.lower()) # 'москва' (Russian)
White_Space
import unicodedata
# Not all whitespace is U+0020!
spaces = [
'\u0020', # SPACE
'\u00A0', # NO-BREAK SPACE
'\u2003', # EM SPACE
'\u2009', # THIN SPACE
'\u200B', # ZERO WIDTH SPACE (NOT white_space!)
'\u3000', # IDEOGRAPHIC SPACE (fullwidth)
]
for s in spaces:
print(f'U+{ord(s):04X} {unicodedata.name(s)}: '
f'category={unicodedata.category(s)}')
# In Python, str.split() and str.strip() use Unicode White_Space
# They handle all Unicode spaces, not just ASCII space
"hello\u2003world".split() # ['hello', 'world']
Numeric Value
Characters in categories Nd, Nl, and No carry a numeric value property. This allows cross-script arithmetic:
import unicodedata
digits = ['5', '٥', '৫', '๕', '᪕', '⑤']
for d in digits:
val = unicodedata.digit(d, None)
nval = unicodedata.numeric(d, None)
print(f'{d} digit={val} numeric={nval} '
f'name={unicodedata.name(d)}')
# 5 digit=5 numeric=5.0 DIGIT FIVE
# ٥ digit=5 numeric=5.0 ARABIC-INDIC DIGIT FIVE
# ৫ digit=5 numeric=5.0 BENGALI DIGIT FIVE
# ๕ digit=5 numeric=5.0 THAI DIGIT FIVE
# ᪕ digit=5 numeric=5.0 CHAM DIGIT FIVE
# ⑤ digit=None numeric=5.0 CIRCLED DIGIT FIVE (not a decimal digit)
Note the distinction:
- digit(): Returns value for characters that are decimal digits (Nd). Returns None for circled numbers, Roman numerals, etc.
- numeric(): Returns numeric value for all numeric characters including fractions (½ = 0.5), Roman numerals (Ⅳ = 4.0), and other number forms.
Bidi Class
The Bidi Class (Bidirectional Category) determines how a character participates in the Unicode Bidirectional Algorithm. Common values:
| Class | Description | Examples |
|---|---|---|
| L | Left-to-Right | Latin letters, digits |
| R | Right-to-Left | Hebrew letters |
| AL | Arabic Letter | Arabic, Thaana, Syriac |
| EN | European Number | ASCII digits 0–9 |
| AN | Arabic Number | Arabic-Indic digits |
| NSM | Nonspacing Mark | Combining marks |
| WS | Whitespace | Space |
| ON | Other Neutral | Most punctuation |
| Pop Directional Format | U+202C | |
| LRM | Left-to-Right Mark | U+200E |
| RLM | Right-to-Left Mark | U+200F |
We explore how these properties drive the Bidirectional Algorithm in Bidirectional Text in Unicode.
Practical Regex with Unicode Properties
Python with the regex Module
Python's built-in re module has limited Unicode property support. The third-party regex module (install with uv add regex or pip install regex) provides full \p{...} support:
import regex
# Any Unicode letter
regex.findall(r'\p{L}+', "Hello 世界 مرحبا")
# ['Hello', '世界', 'مرحبا']
# Any decimal digit in any script
regex.findall(r'\p{Nd}+', "42 ٤٢ ৪২")
# ['42', '٤٢', '৪২']
# Negated: anything that is NOT a letter or digit
regex.findall(r'\P{Alnum}+', "hello, world! 42")
# [', ', '! ']
# Script-specific
regex.findall(r'\p{Script=Hiragana}+', "Hello はじめまして World")
# ['はじめまして']
# Currency symbols
regex.findall(r'\p{Sc}\d+', "Paid $42 and €15 and ¥1000")
# ['$42', '€15', '¥1000']
JavaScript Unicode Property Escapes
// Available in ES2018+ (Node.js 10+, modern browsers)
// Requires /u flag
// Match any letter
/\p{Letter}+/gu
// Match any number (all numeric categories)
/\p{Number}/gu
// Match emoji (using Emoji property)
/\p{Emoji}/gu
// Negate with \P
/\P{ASCII}/gu // Any non-ASCII character
// Binary properties
/\p{White_Space}/gu
/\p{Uppercase}/gu
/\p{Lowercase}/gu
/\p{Alphabetic}/gu
/\p{Ideographic}/gu // CJK ideographs
// Practical: validate that a string contains only letters and spaces
function isNameValid(name) {
return /^[\p{Letter}\p{Space_Separator}]+$/u.test(name);
}
isNameValid("María García"); // true
isNameValid("田中太郎"); // true
isNameValid("User123"); // false (contains digits)
Case Folding
Case folding is the process of normalizing case for caseless matching. It is more thorough than simple lowercasing:
import unicodedata
# casefold() is Python's case-folding method (more aggressive than lower())
print('ß'.lower()) # 'ß' — German sharp s, lowercase stays ß
print('ß'.casefold()) # 'ss' — case-folded to two characters!
print('DŽ'.lower()) # 'dž' (titlecase → lowercase)
print('DŽ'.casefold()) # 'dž' (titlecase → lowercase casefold)
# For caseless comparison, use casefold()
def caseless_equal(s1, s2):
return s1.casefold() == s2.casefold()
caseless_equal('CAFÉ', 'café') # True
caseless_equal('straße', 'STRASSE') # True (ß → ss)
emoji-properties">Emoji Properties
Unicode 16.0 defines several emoji-specific properties:
| Property | Description |
|---|---|
Emoji |
Is an emoji character |
Emoji_Presentation |
Displays as emoji by default |
Emoji_Modifier |
Can modify other emoji (skin tone modifiers) |
Emoji_Modifier_Base |
Can be modified by Emoji_Modifier |
Emoji_Component |
Used in emoji sequences but not standalone emoji |
Extended_Pictographic |
Any character that could be an emoji |
RGI_Emoji |
Recommended for General Interchange |
// Find emoji in text
function extractEmoji(text) {
return [...text.match(/\p{RGI_Emoji}/gv) || []];
}
// Note: /v flag (Unicode Sets mode) is required for RGI_Emoji
// Available in Node.js 20+, Chrome 112+, Safari 17+
extractEmoji("Hello 😀 World 🌍 café"); // ['😀', '🌍']
Accessing Properties via SymbolFYI
Our Unicode Lookup tool displays all major properties for any character you enter: its code point, official name, General Category, Script, Bidi Class, and more. The Character Counter provides property breakdowns for entire strings, helping you understand the composition of any text.
Summary
Unicode character properties are the foundation of correct internationalized text processing:
- General Category (Lu, Ll, Nd, Po, etc.) classifies what a character is
- Script (Latin, Arabic, Han, etc.) identifies which writing system it belongs to
- Derived properties (Alphabetic, ID_Start, White_Space) provide practical classifications for common use cases
- Numeric value enables cross-script number handling
- Bidi Class drives right-to-left text rendering
- Emoji properties identify and classify emoji characters
Regex engines that support \p{...} property escapes — JavaScript with /u, Python with the regex module, Java with Pattern.UNICODE_CHARACTER_CLASS — expose all of this power in pattern matching.
Next in Series: Bidirectional Text in Unicode: How RTL and LTR Scripts Coexist — Understand the Unicode Bidirectional Algorithm and how Arabic, Hebrew, and other RTL scripts mix with left-to-right text.