SymbolFYI

Unicode Property Escapes (\p{})

Programming & Dev
Definisi

Regex syntax (\p{Script=Greek}, \p{Letter}) that matches characters by Unicode properties. Supported in JS, Java, Python 3.8+.

Unicode Property Escapes

Unicode property escapes are a regular expression syntax that allows matching characters by their Unicode properties rather than by listing individual characters or ranges. Instead of maintaining brittle character class lists, you can write \p{Script=Latin} to match any Latin-script character, regardless of how many such characters exist.

Syntax

\p{Property}         # match character with this property
\p{Property=Value}   # match character where property equals value
\P{Property}         # negation: does NOT have this property

JavaScript (ES2018+)

JavaScript supports Unicode property escapes with the /u flag (ES2018) or the /v flag (ES2024):

// Script property
const latinRegex = /\p{Script=Latin}/u;
console.log(latinRegex.test('A'));   // true
console.log(latinRegex.test('α'));   // false (Greek)
console.log(latinRegex.test('日'));  // false (Han)

// Script_Extensions (character used in multiple scripts)
const hanRegex = /\p{Script_Extensions=Han}/u;

// General category
const letterRegex = /\p{Letter}/u;         // any letter
const uppercaseRegex = /\p{Uppercase_Letter}/u;
const digitRegex = /\p{Decimal_Number}/u;

// Emoji
const emojiRegex = /\p{Emoji}/u;
console.log(emojiRegex.test('😀'));  // true

// Binary properties
const whiteSpaceRegex = /\p{White_Space}/u;
const alphanumericRegex = /\p{Alphanumeric}/u;

// Practical: match all CJK characters
const cjkRegex = /\p{Script=Han}+/gu;
console.log('Hello 世界!'.match(cjkRegex));  // ['世界']

Python (regex module)

Python's built-in re module does not support Unicode property escapes. Use the third-party regex module:

import regex  # pip install regex

# Match Latin script
pattern = regex.compile(r'\p{Script=Latin}+')
print(pattern.findall('Hello, 世界!'))  # ['Hello']

# Match any letter in any script
word_pattern = regex.compile(r'\p{L}+')
print(word_pattern.findall('Hello, 世界, мир!'))  # ['Hello', '世界', 'мир']

# Emoji
emoji_pattern = regex.compile(r'\p{Emoji}')
print(emoji_pattern.findall('Hi 😀 there 🎉'))  # ['😀', '🎉']

# Negation: find non-ASCII characters
non_ascii = regex.compile(r'\P{ASCII}+')
print(non_ascii.findall('Café résumé'))  # ['é', 'é', 'é']

Commonly Used Properties

Property Example Matches
Script=Latin \p{Script=Latin} A–Z, a–z, accented Latin letters
Script=Cyrillic \p{Script=Cyrillic} Russian, Bulgarian, etc.
Script=Han \p{Script=Han} Chinese, Japanese kanji
Letter / L \p{L} Any letter in any script
Number / N \p{N} Any numeric character
Emoji \p{Emoji} Emoji characters
White_Space \p{White_Space} Spaces, tabs, newlines
Uppercase_Letter \p{Lu} Uppercase letters
Lowercase_Letter \p{Ll} Lowercase letters

Why Property Escapes Matter

Before property escapes, matching "any letter" in a multilingual application required enormous character class lists or complex workarounds. With \p{L}, you get correct behavior across all 149,000+ Unicode characters with a single, readable token — and the regex engine automatically stays up to date as new Unicode versions add characters.

Simbol Terkait

Istilah Terkait

Alat Terkait

Panduan Terkait