SymbolFYI

Unicode Properties and Categories: Classifying Every Character

Every Unicode character carries metadata far beyond its visual appearance. Each of the 154,000+ assigned characters has dozens of properties: its general category, which script it belongs to, whether it is a letter or digit, how it behaves in bidirectional text, whether it is case-foldable, and much more. These properties are the engine behind internationalized text processing — the reason a regex like \p{Letter} matches क and ш and ψ and 가 as naturally as it matches A.

The Unicode Character Database

All character properties are defined in the Unicode Character Database (UCD), a set of machine-readable data files maintained by the Unicode Consortium. The UCD is the authoritative source that programming languages, operating systems, and text processing libraries implement.

The most important data files include: - UnicodeData.txt — basic properties for every character (name, category, numeric values, decomposition) - Scripts.txt — which script each character belongs to - DerivedCoreProperties.txt — derived properties like "Alphabetic" and "ID_Start" - PropList.txt — miscellaneous binary properties - CaseFolding.txt — case folding data - BidirectionalData.txt — bidi properties

You can access all of this from the Unicode website or from your programming language's standard library.

General Category

The General Category is the most fundamental character property. Every Unicode character is assigned exactly one General Category, which classifies its basic type.

Category Table

Code Name Examples
Letter
Lu Uppercase Letter A, B, Ñ, Ü, Ж
Ll Lowercase Letter a, b, ñ, ü, ж
Lt Titlecase Letter Dž, Lj (digraphs)
Lm Modifier Letter ʰ, ʲ (phonetic modifiers)
Lo Other Letter 中, あ, 가, ؟ (letters without case)
Mark
Mn Nonspacing Mark Combining accent ́, combining diaeresis ̈
Mc Spacing Combining Mark Devanagari vowel signs
Me Enclosing Mark Combining enclosing circle
Number
Nd Decimal Digit Number 0–9, ٠–٩ (Arabic-Indic), ०–९ (Devanagari)
Nl Letter Number Ⅳ, ⅿ (Roman numerals in letter form)
No Other Number ², ½, ②
Punctuation
Pc Connector Punctuation _ (underscore)
Pd Dash Punctuation -, –, —
Ps Open Punctuation (, [, {
Pe Close Punctuation ), ], }
Pi Initial Quote Punctuation ", ' (opening quotes)
Pf Final Quote Punctuation ", ' (closing quotes)
Po Other Punctuation !, ?, ., ,
Symbol
Sm Math Symbol +, =, ×, ∑, ∫
Sc Currency Symbol $, €, ¥, £, ₿
Sk Modifier Symbol ^, `, ˆ (spacing modifier letters)
So Other Symbol ©, ®, ♠, ✓, 😀
Separator
Zs Space Separator Space, non-breaking space, em space
Zl Line Separator U+2028
Zp Paragraph Separator U+2029
Other
Cc Control Tab, newline, null, DEL
Cf Format Zero-width joiner, BOM, soft hyphen
Cs Surrogate U+D800–U+DFFF
Co Private Use U+E000–U+F8FF, planes 15–16
Cn Unassigned All unassigned code points

Accessing General Category in Python

import unicodedata

chars = ['A', 'a', '3', '中', '!', '+', '$', '😀', ' ', '\u0301']
for c in chars:
    cat = unicodedata.category(c)
    name = unicodedata.name(c, '(no name)')
    print(f'U+{ord(c):04X}  {cat}  {name}')

# U+0041  Lu  LATIN CAPITAL LETTER A
# U+0061  Ll  LATIN SMALL LETTER A
# U+0033  Nd  DIGIT THREE
# U+4E2D  Lo  CJK UNIFIED IDEOGRAPH-4E2D
# U+0021  Po  EXCLAMATION MARK
# U+002B  Sm  PLUS SIGN
# U+0024  Sc  DOLLAR SIGN
# U+1F600  So  GRINNING FACE
# U+0020  Zs  SPACE
# U+0301  Mn  COMBINING ACUTE ACCENT

Using Categories for Text Processing

import unicodedata

def count_by_category(text):
    """Count characters by their Unicode general category."""
    counts = {}
    for c in text:
        cat = unicodedata.category(c)
        counts[cat] = counts.get(cat, 0) + 1
    return counts

sample = "Hello, 世界! 42 + $3.14 = π"
print(count_by_category(sample))
# {'Lu': 1, 'Ll': 4, 'Po': 3, 'Zs': 5, 'Lo': 2, 'Nd': 3,
#  'Sm': 2, 'Sc': 1, 'Ll': 1}

# Extract only letters
letters_only = ''.join(c for c in text
                       if unicodedata.category(c).startswith('L'))

The Script Property

While General Category tells you what a character is, the Script property tells you which writing system it belongs to. The Script property is essential for language detection, font selection, and text segmentation.

Common script values include: Latin, Greek, Cyrillic, Arabic, Hebrew, Devanagari, Bengali, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Tibetan, Georgian, Hangul, Hiragana, Katakana, Han (CJK), and about 170 more.

Some characters have script value Common (punctuation, digits, symbols used across scripts) or Inherited (combining marks that inherit the script of the base character they attach to).

# Python 3.x doesn't expose Script directly from unicodedata
# Use the 'regex' package for script-aware matching
import regex

# Match any Arabic letter
arabic_pattern = regex.compile(r'\p{Script=Arabic}+')
text = "مرحبا بالعالم Hello مرحبا"
matches = arabic_pattern.findall(text)
print(matches)  # ['مرحبا', 'بالعالم', 'مرحبا']

# Match any Han (CJK) character
han_pattern = regex.compile(r'\p{Script=Han}+')
text2 = "Hello 世界 World 日本語"
print(han_pattern.findall(text2))  # ['世界', '日本語']

Script in JavaScript

ECMAScript 2018 introduced Unicode property escapes in regular expressions. The \p{...} syntax gives direct access to Unicode properties:

// Script matching
const arabicRE = /\p{Script=Arabic}+/gu;
const text = "مرحبا بالعالم Hello";
console.log(text.match(arabicRE));  // ['مرحبا', 'بالعالم']

// General category matching
const letterRE = /\p{Letter}+/gu;
const text2 = "Hello 42 世界 3.14";
console.log(text2.match(letterRE));  // ['Hello', '世界']

// Decimal digits in any script
const digitRE = /\p{Decimal_Number}+/gu;
const text3 = "42 ١٢٣ ৪৫৬";  // ASCII, Arabic-Indic, Bengali
console.log(text3.match(digitRE));  // ['42', '١٢٣', '৪৫৬']

The u flag is required for Unicode mode in JavaScript regex, and the g flag enables global matching. Without u, \p is treated as a literal character class.

Derived Properties

Beyond the basic UCD properties, Unicode defines a number of derived properties computed from combinations of basic properties. These are the ones you most commonly use in practice:

Alphabetic

A character is Alphabetic if it is a letter or a combining mark that appears in words. This includes: - All Ll, Lu, Lt, Lm, Lo characters - Some Mn/Mc combining marks that function as letters in their scripts - Number letters (Nl like Roman numerals)

// \p{Alphabetic} matches letters across all scripts
const alphaRE = /\p{Alphabetic}+/gu;
"café αβγ مرحبا 漢字".match(alphaRE);
// ['café', 'αβγ', 'مرحبا', '漢字']

ID_Start and ID_Continue

These properties define which characters can appear at the start or continuation of an identifier — used by programming language lexers to determine valid variable names.

import unicodedata

# Characters that can start a Python identifier
def can_start_identifier(c):
    cat = unicodedata.category(c)
    return cat in ('Lu', 'Ll', 'Lt', 'Lm', 'Lo', 'Nl') or c == '_'

# Python identifiers can use Unicode letters
# This is valid Python 3:
# π = 3.14159
# 変数 = "variable"
# café = "coffee shop"

Uppercase, Lowercase, Cased

import unicodedata

# unicodedata provides these as functions
print(unicodedata.category('A'))  # Lu (Uppercase Letter)
print(unicodedata.category('a'))  # Ll (Lowercase Letter)

# Check case properties
print('A'.isupper())   # True
print('a'.islower())   # True
print('中'.isupper())  # False
print('中'.islower())  # False — CJK has no case

# Case conversion works across scripts
print('ñoño'.upper())   # 'ÑOÑO'
print('ΑΘΗΝΑ'.lower())  # 'αθηνα' (Greek)
print('МОСКВА'.lower()) # 'москва' (Russian)

White_Space

import unicodedata

# Not all whitespace is U+0020!
spaces = [
    '\u0020',   # SPACE
    '\u00A0',   # NO-BREAK SPACE
    '\u2003',   # EM SPACE
    '\u2009',   # THIN SPACE
    '\u200B',   # ZERO WIDTH SPACE (NOT white_space!)
    '\u3000',   # IDEOGRAPHIC SPACE (fullwidth)
]

for s in spaces:
    print(f'U+{ord(s):04X} {unicodedata.name(s)}: '
          f'category={unicodedata.category(s)}')

# In Python, str.split() and str.strip() use Unicode White_Space
# They handle all Unicode spaces, not just ASCII space
"hello\u2003world".split()  # ['hello', 'world']

Numeric Value

Characters in categories Nd, Nl, and No carry a numeric value property. This allows cross-script arithmetic:

import unicodedata

digits = ['5', '٥', '৫', '๕', '᪕', '⑤']
for d in digits:
    val = unicodedata.digit(d, None)
    nval = unicodedata.numeric(d, None)
    print(f'{d}  digit={val}  numeric={nval}  '
          f'name={unicodedata.name(d)}')

# 5  digit=5  numeric=5.0  DIGIT FIVE
# ٥  digit=5  numeric=5.0  ARABIC-INDIC DIGIT FIVE
# ৫  digit=5  numeric=5.0  BENGALI DIGIT FIVE
# ๕  digit=5  numeric=5.0  THAI DIGIT FIVE
# ᪕  digit=5  numeric=5.0  CHAM DIGIT FIVE
# ⑤  digit=None  numeric=5.0  CIRCLED DIGIT FIVE (not a decimal digit)

Note the distinction: - digit(): Returns value for characters that are decimal digits (Nd). Returns None for circled numbers, Roman numerals, etc. - numeric(): Returns numeric value for all numeric characters including fractions (½ = 0.5), Roman numerals (Ⅳ = 4.0), and other number forms.

Bidi Class

The Bidi Class (Bidirectional Category) determines how a character participates in the Unicode Bidirectional Algorithm. Common values:

Class Description Examples
L Left-to-Right Latin letters, digits
R Right-to-Left Hebrew letters
AL Arabic Letter Arabic, Thaana, Syriac
EN European Number ASCII digits 0–9
AN Arabic Number Arabic-Indic digits
NSM Nonspacing Mark Combining marks
WS Whitespace Space
ON Other Neutral Most punctuation
PDF Pop Directional Format U+202C
LRM Left-to-Right Mark U+200E
RLM Right-to-Left Mark U+200F

We explore how these properties drive the Bidirectional Algorithm in Bidirectional Text in Unicode.

Practical Regex with Unicode Properties

Python with the regex Module

Python's built-in re module has limited Unicode property support. The third-party regex module (install with uv add regex or pip install regex) provides full \p{...} support:

import regex

# Any Unicode letter
regex.findall(r'\p{L}+', "Hello 世界 مرحبا")
# ['Hello', '世界', 'مرحبا']

# Any decimal digit in any script
regex.findall(r'\p{Nd}+', "42 ٤٢ ৪২")
# ['42', '٤٢', '৪২']

# Negated: anything that is NOT a letter or digit
regex.findall(r'\P{Alnum}+', "hello, world! 42")
# [', ', '! ']

# Script-specific
regex.findall(r'\p{Script=Hiragana}+', "Hello はじめまして World")
# ['はじめまして']

# Currency symbols
regex.findall(r'\p{Sc}\d+', "Paid $42 and €15 and ¥1000")
# ['$42', '€15', '¥1000']

JavaScript Unicode Property Escapes

// Available in ES2018+ (Node.js 10+, modern browsers)
// Requires /u flag

// Match any letter
/\p{Letter}+/gu

// Match any number (all numeric categories)
/\p{Number}/gu

// Match emoji (using Emoji property)
/\p{Emoji}/gu

// Negate with \P
/\P{ASCII}/gu   // Any non-ASCII character

// Binary properties
/\p{White_Space}/gu
/\p{Uppercase}/gu
/\p{Lowercase}/gu
/\p{Alphabetic}/gu
/\p{Ideographic}/gu   // CJK ideographs

// Practical: validate that a string contains only letters and spaces
function isNameValid(name) {
    return /^[\p{Letter}\p{Space_Separator}]+$/u.test(name);
}

isNameValid("María García");    // true
isNameValid("田中太郎");          // true
isNameValid("User123");          // false (contains digits)

Case Folding

Case folding is the process of normalizing case for caseless matching. It is more thorough than simple lowercasing:

import unicodedata

# casefold() is Python's case-folding method (more aggressive than lower())
print('ß'.lower())      # 'ß' — German sharp s, lowercase stays ß
print('ß'.casefold())   # 'ss' — case-folded to two characters!

print('DŽ'.lower())      # 'dž' (titlecase → lowercase)
print('DŽ'.casefold())   # 'dž' (titlecase → lowercase casefold)

# For caseless comparison, use casefold()
def caseless_equal(s1, s2):
    return s1.casefold() == s2.casefold()

caseless_equal('CAFÉ', 'café')  # True
caseless_equal('straße', 'STRASSE')  # True (ß → ss)

emoji-properties">Emoji Properties

Unicode 16.0 defines several emoji-specific properties:

Property Description
Emoji Is an emoji character
Emoji_Presentation Displays as emoji by default
Emoji_Modifier Can modify other emoji (skin tone modifiers)
Emoji_Modifier_Base Can be modified by Emoji_Modifier
Emoji_Component Used in emoji sequences but not standalone emoji
Extended_Pictographic Any character that could be an emoji
RGI_Emoji Recommended for General Interchange
// Find emoji in text
function extractEmoji(text) {
    return [...text.match(/\p{RGI_Emoji}/gv) || []];
}

// Note: /v flag (Unicode Sets mode) is required for RGI_Emoji
// Available in Node.js 20+, Chrome 112+, Safari 17+
extractEmoji("Hello 😀 World 🌍 café");  // ['😀', '🌍']

Accessing Properties via SymbolFYI

Our Unicode Lookup tool displays all major properties for any character you enter: its code point, official name, General Category, Script, Bidi Class, and more. The Character Counter provides property breakdowns for entire strings, helping you understand the composition of any text.

Summary

Unicode character properties are the foundation of correct internationalized text processing:

  • General Category (Lu, Ll, Nd, Po, etc.) classifies what a character is
  • Script (Latin, Arabic, Han, etc.) identifies which writing system it belongs to
  • Derived properties (Alphabetic, ID_Start, White_Space) provide practical classifications for common use cases
  • Numeric value enables cross-script number handling
  • Bidi Class drives right-to-left text rendering
  • Emoji properties identify and classify emoji characters

Regex engines that support \p{...} property escapes — JavaScript with /u, Python with the regex module, Java with Pattern.UNICODE_CHARACTER_CLASS — expose all of this power in pattern matching.


Next in Series: Bidirectional Text in Unicode: How RTL and LTR Scripts Coexist — Understand the Unicode Bidirectional Algorithm and how Arabic, Hebrew, and other RTL scripts mix with left-to-right text.

관련 기호

관련 용어

관련 도구

더 많은 가이드