Unicode Properties and Categories: Classifying Every Character

Unicode Deep Dive Unicode Deep Dive Nis 11, 2023

○ 1. What Is Unicode? The Universal Character Standard Explained
○ 2. Unicode Planes and Blocks: How 1.1 Million Code Points Are Organized
○ 3. Unicode Encodings Explained: UTF-8, UTF-16, and UTF-32 Compared
○ 4. Unicode Normalization: NFC, NFD, NFKC, and NFKD Explained
● 5. Unicode Properties and Categories: Classifying Every Character
○ 6. Bidirectional Text in Unicode: How RTL and LTR Scripts Coexist
○ 7. How Emoji Work in Unicode: From Code Points to Skin Tones
○ 8. CJK Unification: How Unicode Handles Chinese, Japanese, and Korean
○ 9. Unicode Version History: From 1.0 to 16.0 and Beyond
○ 10. Unicode CLDR: The Database Behind Every Localized App

Every Unicode character carries metadata far beyond its visual appearance. Each of the 154,000+ assigned characters has dozens of properties: its general category, which script it belongs to, whether it is a letter or digit, how it behaves in bidirectional text, whether it is case-foldable, and much more. These properties are the engine behind internationalized text processing — the reason a regex like \p{Letter} matches क and ш and ψ and 가 as naturally as it matches A.

The Unicode Character Database

All character properties are defined in the Unicode Character Database (UCD), a set of machine-readable data files maintained by the Unicode Consortium. The UCD is the authoritative source that programming languages, operating systems, and text processing libraries implement.

The most important data files include: - UnicodeData.txt — basic properties for every character (name, category, numeric values, decomposition) - Scripts.txt — which script each character belongs to - DerivedCoreProperties.txt — derived properties like "Alphabetic" and "ID_Start" - PropList.txt — miscellaneous binary properties - CaseFolding.txt — case folding data - BidirectionalData.txt — bidi properties

You can access all of this from the Unicode website or from your programming language's standard library.

General Category

The General Category is the most fundamental character property. Every Unicode character is assigned exactly one General Category, which classifies its basic type.

Category Table

Code	Name	Examples
Letter
Lu	Uppercase Letter	A, B, Ñ, Ü, Ж
Ll	Lowercase Letter	a, b, ñ, ü, ж
Lt	Titlecase Letter	Dž, Lj (digraphs)
Lm	Modifier Letter	ʰ, ʲ (phonetic modifiers)
Lo	Other Letter	中, あ, 가, ؟ (letters without case)
Mark
Mn	Nonspacing Mark	Combining accent ́, combining diaeresis ̈
Mc	Spacing Combining Mark	Devanagari vowel signs
Me	Enclosing Mark	Combining enclosing circle
Number
Nd	Decimal Digit Number	0–9, ٠–٩ (Arabic-Indic), ०–९ (Devanagari)
Nl	Letter Number	Ⅳ, ⅿ (Roman numerals in letter form)
No	Other Number	², ½, ②
Punctuation
Pc	Connector Punctuation	_ (underscore)
Pd	Dash Punctuation	-, –, —
Ps	Open Punctuation	(, [, {
Pe	Close Punctuation	), ], }
Pi	Initial Quote Punctuation	", ' (opening quotes)
Pf	Final Quote Punctuation	", ' (closing quotes)
Po	Other Punctuation	!, ?, ., ,
Symbol
Sm	Math Symbol	+, =, ×, ∑, ∫
Sc	Currency Symbol	$, €, ¥, £, ₿
Sk	Modifier Symbol	^, `, ˆ (spacing modifier letters)
So	Other Symbol	©, ®, ♠, ✓, 😀
Separator
Zs	Space Separator	Space, non-breaking space, em space
Zl	Line Separator	U+2028
Zp	Paragraph Separator	U+2029
Other
Cc	Control	Tab, newline, null, DEL
Cf	Format	Zero-width joiner, BOM, soft hyphen
Cs	Surrogate	U+D800–U+DFFF
Co	Private Use	U+E000–U+F8FF, planes 15–16
Cn	Unassigned	All unassigned code points

Accessing General Category in Python

import unicodedata

chars = ['A', 'a', '3', '中', '!', '+', '$', '😀', ' ', '\u0301']
for c in chars:
    cat = unicodedata.category(c)
    name = unicodedata.name(c, '(no name)')
    print(f'U+{ord(c):04X}  {cat}  {name}')

# U+0041  Lu  LATIN CAPITAL LETTER A
# U+0061  Ll  LATIN SMALL LETTER A
# U+0033  Nd  DIGIT THREE
# U+4E2D  Lo  CJK UNIFIED IDEOGRAPH-4E2D
# U+0021  Po  EXCLAMATION MARK
# U+002B  Sm  PLUS SIGN
# U+0024  Sc  DOLLAR SIGN
# U+1F600  So  GRINNING FACE
# U+0020  Zs  SPACE
# U+0301  Mn  COMBINING ACUTE ACCENT

Using Categories for Text Processing

import unicodedata

def count_by_category(text):
    """Count characters by their Unicode general category."""
    counts = {}
    for c in text:
        cat = unicodedata.category(c)
        counts[cat] = counts.get(cat, 0) + 1
    return counts

sample = "Hello, 世界! 42 + $3.14 = π"
print(count_by_category(sample))
# {'Lu': 1, 'Ll': 4, 'Po': 3, 'Zs': 5, 'Lo': 2, 'Nd': 3,
#  'Sm': 2, 'Sc': 1, 'Ll': 1}

# Extract only letters
letters_only = ''.join(c for c in text
                       if unicodedata.category(c).startswith('L'))

The Script Property

While General Category tells you what a character is, the Script property tells you which writing system it belongs to. The Script property is essential for language detection, font selection, and text segmentation.

Common script values include: Latin, Greek, Cyrillic, Arabic, Hebrew, Devanagari, Bengali, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Tibetan, Georgian, Hangul, Hiragana, Katakana, Han (CJK), and about 170 more.

Some characters have script value Common (punctuation, digits, symbols used across scripts) or Inherited (combining marks that inherit the script of the base character they attach to).

# Python 3.x doesn't expose Script directly from unicodedata
# Use the 'regex' package for script-aware matching
import regex

# Match any Arabic letter
arabic_pattern = regex.compile(r'\p{Script=Arabic}+')
text = "مرحبا بالعالم Hello مرحبا"
matches = arabic_pattern.findall(text)
print(matches)  # ['مرحبا', 'بالعالم', 'مرحبا']

# Match any Han (CJK) character
han_pattern = regex.compile(r'\p{Script=Han}+')
text2 = "Hello 世界 World 日本語"
print(han_pattern.findall(text2))  # ['世界', '日本語']

Script in JavaScript

ECMAScript 2018 introduced Unicode property escapes in regular expressions. The \p{...} syntax gives direct access to Unicode properties:

// Script matching
const arabicRE = /\p{Script=Arabic}+/gu;
const text = "مرحبا بالعالم Hello";
console.log(text.match(arabicRE));  // ['مرحبا', 'بالعالم']

// General category matching
const letterRE = /\p{Letter}+/gu;
const text2 = "Hello 42 世界 3.14";
console.log(text2.match(letterRE));  // ['Hello', '世界']

// Decimal digits in any script
const digitRE = /\p{Decimal_Number}+/gu;
const text3 = "42 ١٢٣ ৪৫৬";  // ASCII, Arabic-Indic, Bengali
console.log(text3.match(digitRE));  // ['42', '١٢٣', '৪৫৬']

The u flag is required for Unicode mode in JavaScript regex, and the g flag enables global matching. Without u, \p is treated as a literal character class.

Derived Properties

Beyond the basic UCD properties, Unicode defines a number of derived properties computed from combinations of basic properties. These are the ones you most commonly use in practice:

Alphabetic

A character is Alphabetic if it is a letter or a combining mark that appears in words. This includes: - All Ll, Lu, Lt, Lm, Lo characters - Some Mn/Mc combining marks that function as letters in their scripts - Number letters (Nl like Roman numerals)

// \p{Alphabetic} matches letters across all scripts
const alphaRE = /\p{Alphabetic}+/gu;
"café αβγ مرحبا 漢字".match(alphaRE);
// ['café', 'αβγ', 'مرحبا', '漢字']

ID_Start and ID_Continue

These properties define which characters can appear at the start or continuation of an identifier — used by programming language lexers to determine valid variable names.

import unicodedata

# Characters that can start a Python identifier
def can_start_identifier(c):
    cat = unicodedata.category(c)
    return cat in ('Lu', 'Ll', 'Lt', 'Lm', 'Lo', 'Nl') or c == '_'

# Python identifiers can use Unicode letters
# This is valid Python 3:
# π = 3.14159
# 変数 = "variable"
# café = "coffee shop"

Uppercase, Lowercase, Cased

import unicodedata

# unicodedata provides these as functions
print(unicodedata.category('A'))  # Lu (Uppercase Letter)
print(unicodedata.category('a'))  # Ll (Lowercase Letter)

# Check case properties
print('A'.isupper())   # True
print('a'.islower())   # True
print('中'.isupper())  # False
print('中'.islower())  # False — CJK has no case

# Case conversion works across scripts
print('ñoño'.upper())   # 'ÑOÑO'
print('ΑΘΗΝΑ'.lower())  # 'αθηνα' (Greek)
print('МОСКВА'.lower()) # 'москва' (Russian)

White_Space

import unicodedata

# Not all whitespace is U+0020!
spaces = [
    '\u0020',   # SPACE
    '\u00A0',   # NO-BREAK SPACE
    '\u2003',   # EM SPACE
    '\u2009',   # THIN SPACE
    '\u200B',   # ZERO WIDTH SPACE (NOT white_space!)
    '\u3000',   # IDEOGRAPHIC SPACE (fullwidth)
]

for s in spaces:
    print(f'U+{ord(s):04X} {unicodedata.name(s)}: '
          f'category={unicodedata.category(s)}')

# In Python, str.split() and str.strip() use Unicode White_Space
# They handle all Unicode spaces, not just ASCII space
"hello\u2003world".split()  # ['hello', 'world']

Numeric Value

Characters in categories Nd, Nl, and No carry a numeric value property. This allows cross-script arithmetic:

import unicodedata

digits = ['5', '٥', '৫', '๕', '᪕', '⑤']
for d in digits:
    val = unicodedata.digit(d, None)
    nval = unicodedata.numeric(d, None)
    print(f'{d}  digit={val}  numeric={nval}  '
          f'name={unicodedata.name(d)}')

# 5  digit=5  numeric=5.0  DIGIT FIVE
# ٥  digit=5  numeric=5.0  ARABIC-INDIC DIGIT FIVE
# ৫  digit=5  numeric=5.0  BENGALI DIGIT FIVE
# ๕  digit=5  numeric=5.0  THAI DIGIT FIVE
# ᪕  digit=5  numeric=5.0  CHAM DIGIT FIVE
# ⑤  digit=None  numeric=5.0  CIRCLED DIGIT FIVE (not a decimal digit)

Note the distinction: - digit(): Returns value for characters that are decimal digits (Nd). Returns None for circled numbers, Roman numerals, etc. - numeric(): Returns numeric value for all numeric characters including fractions (½ = 0.5), Roman numerals (Ⅳ = 4.0), and other number forms.

Bidi Class

The Bidi Class (Bidirectional Category) determines how a character participates in the Unicode Bidirectional Algorithm. Common values:

Class	Description	Examples
L	Left-to-Right	Latin letters, digits
R	Right-to-Left	Hebrew letters
AL	Arabic Letter	Arabic, Thaana, Syriac
EN	European Number	ASCII digits 0–9
AN	Arabic Number	Arabic-Indic digits
NSM	Nonspacing Mark	Combining marks
WS	Whitespace	Space
ON	Other Neutral	Most punctuation
PDF	Pop Directional Format	U+202C
LRM	Left-to-Right Mark	U+200E
RLM	Right-to-Left Mark	U+200F

We explore how these properties drive the Bidirectional Algorithm in Bidirectional Text in Unicode.

Practical Regex with Unicode Properties

Python with the `regex` Module

Python's built-in re module has limited Unicode property support. The third-party regex module (install with uv add regex or pip install regex) provides full \p{...} support:

import regex

# Any Unicode letter
regex.findall(r'\p{L}+', "Hello 世界 مرحبا")
# ['Hello', '世界', 'مرحبا']

# Any decimal digit in any script
regex.findall(r'\p{Nd}+', "42 ٤٢ ৪২")
# ['42', '٤٢', '৪২']

# Negated: anything that is NOT a letter or digit
regex.findall(r'\P{Alnum}+', "hello, world! 42")
# [', ', '! ']

# Script-specific
regex.findall(r'\p{Script=Hiragana}+', "Hello はじめまして World")
# ['はじめまして']

# Currency symbols
regex.findall(r'\p{Sc}\d+', "Paid $42 and €15 and ¥1000")
# ['$42', '€15', '¥1000']

JavaScript Unicode Property Escapes

// Available in ES2018+ (Node.js 10+, modern browsers)
// Requires /u flag

// Match any letter
/\p{Letter}+/gu

// Match any number (all numeric categories)
/\p{Number}/gu

// Match emoji (using Emoji property)
/\p{Emoji}/gu

// Negate with \P
/\P{ASCII}/gu   // Any non-ASCII character

// Binary properties
/\p{White_Space}/gu
/\p{Uppercase}/gu
/\p{Lowercase}/gu
/\p{Alphabetic}/gu
/\p{Ideographic}/gu   // CJK ideographs

// Practical: validate that a string contains only letters and spaces
function isNameValid(name) {
    return /^[\p{Letter}\p{Space_Separator}]+$/u.test(name);
}

isNameValid("María García");    // true
isNameValid("田中太郎");          // true
isNameValid("User123");          // false (contains digits)

Case Folding

Case folding is the process of normalizing case for caseless matching. It is more thorough than simple lowercasing:

import unicodedata

# casefold() is Python's case-folding method (more aggressive than lower())
print('ß'.lower())      # 'ß' — German sharp s, lowercase stays ß
print('ß'.casefold())   # 'ss' — case-folded to two characters!

print('Ǆ'.lower())      # 'ǆ' (titlecase → lowercase)
print('Ǆ'.casefold())   # 'dž' (titlecase → lowercase casefold)

# For caseless comparison, use casefold()
def caseless_equal(s1, s2):
    return s1.casefold() == s2.casefold()

caseless_equal('CAFÉ', 'café')  # True
caseless_equal('straße', 'STRASSE')  # True (ß → ss)

emoji-properties">Emoji Properties

Unicode 16.0 defines several emoji-specific properties:

Property	Description
`Emoji`	Is an emoji character
`Emoji_Presentation`	Displays as emoji by default
`Emoji_Modifier`	Can modify other emoji (skin tone modifiers)
`Emoji_Modifier_Base`	Can be modified by Emoji_Modifier
`Emoji_Component`	Used in emoji sequences but not standalone emoji
`Extended_Pictographic`	Any character that could be an emoji
`RGI_Emoji`	Recommended for General Interchange

// Find emoji in text
function extractEmoji(text) {
    return [...text.match(/\p{RGI_Emoji}/gv) || []];
}

// Note: /v flag (Unicode Sets mode) is required for RGI_Emoji
// Available in Node.js 20+, Chrome 112+, Safari 17+
extractEmoji("Hello 😀 World 🌍 café");  // ['😀', '🌍']

Accessing Properties via SymbolFYI

Our Unicode Lookup tool displays all major properties for any character you enter: its code point, official name, General Category, Script, Bidi Class, and more. The Character Counter provides property breakdowns for entire strings, helping you understand the composition of any text.

Summary

Unicode character properties are the foundation of correct internationalized text processing:

General Category (Lu, Ll, Nd, Po, etc.) classifies what a character is
Script (Latin, Arabic, Han, etc.) identifies which writing system it belongs to
Derived properties (Alphabetic, ID_Start, White_Space) provide practical classifications for common use cases
Numeric value enables cross-script number handling
Bidi Class drives right-to-left text rendering
Emoji properties identify and classify emoji characters

Regex engines that support \p{...} property escapes — JavaScript with /u, Python with the regex module, Java with Pattern.UNICODE_CHARACTER_CLASS — expose all of this power in pattern matching.

Next in Series: Bidirectional Text in Unicode: How RTL and LTR Scripts Coexist — Understand the Unicode Bidirectional Algorithm and how Arabic, Hebrew, and other RTL scripts mix with left-to-right text.