SymbolFYI

Unicode Property Escapes (\p{})

Programming & Dev

Định nghĩa

Regex syntax (\p{Script=Greek}, \p{Letter}) that matches characters by Unicode properties. Supported in JS, Java, Python 3.8+.

Unicode Property Escapes

Unicode property escapes are a regular expression syntax that allows matching characters by their Unicode properties rather than by listing individual characters or ranges. Instead of maintaining brittle character class lists, you can write \p{Script=Latin} to match any Latin-script character, regardless of how many such characters exist.

Syntax

\p{Property}         # match character with this property
\p{Property=Value}   # match character where property equals value
\P{Property}         # negation: does NOT have this property

JavaScript (ES2018+)

JavaScript supports Unicode property escapes with the /u flag (ES2018) or the /v flag (ES2024):

// Script property
const latinRegex = /\p{Script=Latin}/u;
console.log(latinRegex.test('A'));   // true
console.log(latinRegex.test('α'));   // false (Greek)
console.log(latinRegex.test('日'));  // false (Han)

// Script_Extensions (character used in multiple scripts)
const hanRegex = /\p{Script_Extensions=Han}/u;

// General category
const letterRegex = /\p{Letter}/u;         // any letter
const uppercaseRegex = /\p{Uppercase_Letter}/u;
const digitRegex = /\p{Decimal_Number}/u;

// Emoji
const emojiRegex = /\p{Emoji}/u;
console.log(emojiRegex.test('😀'));  // true

// Binary properties
const whiteSpaceRegex = /\p{White_Space}/u;
const alphanumericRegex = /\p{Alphanumeric}/u;

// Practical: match all CJK characters
const cjkRegex = /\p{Script=Han}+/gu;
console.log('Hello 世界!'.match(cjkRegex));  // ['世界']

Python (regex module)

Python's built-in re module does not support Unicode property escapes. Use the third-party regex module:

import regex  # pip install regex

# Match Latin script
pattern = regex.compile(r'\p{Script=Latin}+')
print(pattern.findall('Hello, 世界!'))  # ['Hello']

# Match any letter in any script
word_pattern = regex.compile(r'\p{L}+')
print(word_pattern.findall('Hello, 世界, мир!'))  # ['Hello', '世界', 'мир']

# Emoji
emoji_pattern = regex.compile(r'\p{Emoji}')
print(emoji_pattern.findall('Hi 😀 there 🎉'))  # ['😀', '🎉']

# Negation: find non-ASCII characters
non_ascii = regex.compile(r'\P{ASCII}+')
print(non_ascii.findall('Café résumé'))  # ['é', 'é', 'é']

Commonly Used Properties

Property	Example	Matches
`Script=Latin`	`\p{Script=Latin}`	A–Z, a–z, accented Latin letters
`Script=Cyrillic`	`\p{Script=Cyrillic}`	Russian, Bulgarian, etc.
`Script=Han`	`\p{Script=Han}`	Chinese, Japanese kanji
`Letter` / `L`	`\p{L}`	Any letter in any script
`Number` / `N`	`\p{N}`	Any numeric character
`Emoji`	`\p{Emoji}`	Emoji characters
`White_Space`	`\p{White_Space}`	Spaces, tabs, newlines
`Uppercase_Letter`	`\p{Lu}`	Uppercase letters
`Lowercase_Letter`	`\p{Ll}`	Lowercase letters

Why Property Escapes Matter

Before property escapes, matching "any letter" in a multilingual application required enormous character class lists or complex workarounds. With \p{L}, you get correct behavior across all 149,000+ Unicode characters with a single, readable token — and the regex engine automatically stays up to date as new Unicode versions add characters.

Unicode Property Escapes (\p{})

Unicode Property Escapes

Syntax

JavaScript (ES2018+)

Python (regex module)

Commonly Used Properties

Why Property Escapes Matter

Ký hiệu liên quan

Thuật ngữ liên quan

Công cụ liên quan

Hướng dẫn liên quan