Unicode Property Escapes
Unicode property escapes are a regular expression syntax that allows matching characters by their Unicode properties rather than by listing individual characters or ranges. Instead of maintaining brittle character class lists, you can write \p{Script=Latin} to match any Latin-script character, regardless of how many such characters exist.
Syntax
\p{Property} # match character with this property
\p{Property=Value} # match character where property equals value
\P{Property} # negation: does NOT have this property
JavaScript (ES2018+)
JavaScript supports Unicode property escapes with the /u flag (ES2018) or the /v flag (ES2024):
// Script property
const latinRegex = /\p{Script=Latin}/u;
console.log(latinRegex.test('A')); // true
console.log(latinRegex.test('α')); // false (Greek)
console.log(latinRegex.test('日')); // false (Han)
// Script_Extensions (character used in multiple scripts)
const hanRegex = /\p{Script_Extensions=Han}/u;
// General category
const letterRegex = /\p{Letter}/u; // any letter
const uppercaseRegex = /\p{Uppercase_Letter}/u;
const digitRegex = /\p{Decimal_Number}/u;
// Emoji
const emojiRegex = /\p{Emoji}/u;
console.log(emojiRegex.test('😀')); // true
// Binary properties
const whiteSpaceRegex = /\p{White_Space}/u;
const alphanumericRegex = /\p{Alphanumeric}/u;
// Practical: match all CJK characters
const cjkRegex = /\p{Script=Han}+/gu;
console.log('Hello 世界!'.match(cjkRegex)); // ['世界']
Python (regex module)
Python's built-in re module does not support Unicode property escapes. Use the third-party regex module:
import regex # pip install regex
# Match Latin script
pattern = regex.compile(r'\p{Script=Latin}+')
print(pattern.findall('Hello, 世界!')) # ['Hello']
# Match any letter in any script
word_pattern = regex.compile(r'\p{L}+')
print(word_pattern.findall('Hello, 世界, мир!')) # ['Hello', '世界', 'мир']
# Emoji
emoji_pattern = regex.compile(r'\p{Emoji}')
print(emoji_pattern.findall('Hi 😀 there 🎉')) # ['😀', '🎉']
# Negation: find non-ASCII characters
non_ascii = regex.compile(r'\P{ASCII}+')
print(non_ascii.findall('Café résumé')) # ['é', 'é', 'é']
Commonly Used Properties
| Property | Example | Matches |
|---|---|---|
Script=Latin |
\p{Script=Latin} |
A–Z, a–z, accented Latin letters |
Script=Cyrillic |
\p{Script=Cyrillic} |
Russian, Bulgarian, etc. |
Script=Han |
\p{Script=Han} |
Chinese, Japanese kanji |
Letter / L |
\p{L} |
Any letter in any script |
Number / N |
\p{N} |
Any numeric character |
Emoji |
\p{Emoji} |
Emoji characters |
White_Space |
\p{White_Space} |
Spaces, tabs, newlines |
Uppercase_Letter |
\p{Lu} |
Uppercase letters |
Lowercase_Letter |
\p{Ll} |
Lowercase letters |
Why Property Escapes Matter
Before property escapes, matching "any letter" in a multilingual application required enormous character class lists or complex workarounds. With \p{L}, you get correct behavior across all 149,000+ Unicode characters with a single, readable token — and the regex engine automatically stays up to date as new Unicode versions add characters.