SymbolFYI

Unicode Property Escapes (\p{})

Programming & Dev
Định nghĩa

Regex syntax (\p{Script=Greek}, \p{Letter}) that matches characters by Unicode properties. Supported in JS, Java, Python 3.8+.

Unicode Property Escapes

Unicode property escapes are a regular expression syntax that allows matching characters by their Unicode properties rather than by listing individual characters or ranges. Instead of maintaining brittle character class lists, you can write \p{Script=Latin} to match any Latin-script character, regardless of how many such characters exist.

Syntax

\p{Property}         # match character with this property
\p{Property=Value}   # match character where property equals value
\P{Property}         # negation: does NOT have this property

JavaScript (ES2018+)

JavaScript supports Unicode property escapes with the /u flag (ES2018) or the /v flag (ES2024):

// Script property
const latinRegex = /\p{Script=Latin}/u;
console.log(latinRegex.test('A'));   // true
console.log(latinRegex.test('α'));   // false (Greek)
console.log(latinRegex.test('日'));  // false (Han)

// Script_Extensions (character used in multiple scripts)
const hanRegex = /\p{Script_Extensions=Han}/u;

// General category
const letterRegex = /\p{Letter}/u;         // any letter
const uppercaseRegex = /\p{Uppercase_Letter}/u;
const digitRegex = /\p{Decimal_Number}/u;

// Emoji
const emojiRegex = /\p{Emoji}/u;
console.log(emojiRegex.test('😀'));  // true

// Binary properties
const whiteSpaceRegex = /\p{White_Space}/u;
const alphanumericRegex = /\p{Alphanumeric}/u;

// Practical: match all CJK characters
const cjkRegex = /\p{Script=Han}+/gu;
console.log('Hello 世界!'.match(cjkRegex));  // ['世界']

Python (regex module)

Python's built-in re module does not support Unicode property escapes. Use the third-party regex module:

import regex  # pip install regex

# Match Latin script
pattern = regex.compile(r'\p{Script=Latin}+')
print(pattern.findall('Hello, 世界!'))  # ['Hello']

# Match any letter in any script
word_pattern = regex.compile(r'\p{L}+')
print(word_pattern.findall('Hello, 世界, мир!'))  # ['Hello', '世界', 'мир']

# Emoji
emoji_pattern = regex.compile(r'\p{Emoji}')
print(emoji_pattern.findall('Hi 😀 there 🎉'))  # ['😀', '🎉']

# Negation: find non-ASCII characters
non_ascii = regex.compile(r'\P{ASCII}+')
print(non_ascii.findall('Café résumé'))  # ['é', 'é', 'é']

Commonly Used Properties

Property Example Matches
Script=Latin \p{Script=Latin} A–Z, a–z, accented Latin letters
Script=Cyrillic \p{Script=Cyrillic} Russian, Bulgarian, etc.
Script=Han \p{Script=Han} Chinese, Japanese kanji
Letter / L \p{L} Any letter in any script
Number / N \p{N} Any numeric character
Emoji \p{Emoji} Emoji characters
White_Space \p{White_Space} Spaces, tabs, newlines
Uppercase_Letter \p{Lu} Uppercase letters
Lowercase_Letter \p{Ll} Lowercase letters

Why Property Escapes Matter

Before property escapes, matching "any letter" in a multilingual application required enormous character class lists or complex workarounds. With \p{L}, you get correct behavior across all 149,000+ Unicode characters with a single, readable token — and the regex engine automatically stays up to date as new Unicode versions add characters.

Ký hiệu liên quan

Thuật ngữ liên quan

Công cụ liên quan

Hướng dẫn liên quan