SymbolFYI

Unicode-Aware Regex: Property Escapes and Multilingual Patterns

Most regex tutorials teach patterns that work fine for ASCII and silently break for everything else. When your input includes Arabic names, emoji, East Asian characters, or mathematical symbols, you need Unicode-aware regex. The good news: modern JavaScript and Python both have solid Unicode support. The bad news: they require explicit opt-in and have different APIs.

The Core Problem: Bytes vs. Characters

A naive regex like [a-zA-Z]+ matches ASCII letters only. It misses é, ñ, ø, Ü, α, , and thousands of other letters that Unicode defines. This causes silent data loss in internationalized applications.

The underlying issue is that regex engines historically operated on bytes or ASCII. Unicode-aware mode makes the engine treat patterns as operating on Unicode code points and character properties.

JavaScript: The u and v Flags

The u flag (ES2015+)

The u flag enables proper Unicode mode in JavaScript regex:

// Without u: \w matches only [a-zA-Z0-9_]
/^\w+$/.test("café")        // false — é is not matched
/^\w+$/.test("名前")         // false — CJK not matched

// With u: enables Unicode code point mode
// \w still means [a-zA-Z0-9_] — same as without u
// But escapes and code points work correctly:
/^\u{1F600}$/u.test("😀")    // true — \u{} works only with u flag
/^.$/u.test("😀")            // true — . matches full emoji code point

// Without u, . matches only one UTF-16 code unit:
/^.$/. test("😀")            // false — emoji is 2 code units (surrogate pair)

The u flag also enables Unicode property escapes.

Unicode property escapes (ES2018+)

\p{Property=Value} and \P{Property=Value} (negation) match characters by their Unicode properties:

// General category: Letter
const hasLetter = /\p{L}/u;
hasLetter.test("hello")    // true
hasLetter.test("مرحبا")    // true — Arabic letters
hasLetter.test("12345")    // false — only digits

// Specific category: Uppercase Letter
/\p{Lu}/u.test("A")        // true
/\p{Lu}/u.test("a")        // false

// Script property
/\p{Script=Latin}/u.test("hello")     // true
/\p{Script=Arabic}/u.test("مرحبا")   // true
/\p{Script=Han}/u.test("你好")        // true
/\p{Script=Cyrillic}/u.test("привет") // true

// Script_Extensions — matches characters used in multiple scripts
/\p{Script_Extensions=Latin}/u.test("0")  // true (digits used in Latin text)

// Match any Unicode letter (not just ASCII)
const unicodeLetter = /^[\p{L}\p{M}]+$/u;
unicodeLetter.test("café")     // true
unicodeLetter.test("résumé")   // true
unicodeLetter.test("名前")      // true
unicodeLetter.test("123")      // false

Emoji matching

Emoji require special handling because many emoji are sequences of multiple code points:

// Basic emoji: single code point
/\p{Emoji}/u.test("😀")        // true
/\p{Emoji}/u.test("A")         // true (!) — digits and some ASCII are "emoji"

// Better: Emoji_Presentation matches characters rendered as emoji by default
/\p{Emoji_Presentation}/u.test("😀")   // true
/\p{Emoji_Presentation}/u.test("A")    // false

// Emoji sequences (flag emoji, family emoji, skin tone variants)
// These require matching the full sequence:
const flagEmoji = /\p{Regional_Indicator}\p{Regional_Indicator}/u;
flagEmoji.test("🇺🇸")  // true — U+1F1FA + U+1F1F8

// ZWJ sequences (e.g., 👨‍💻 = man + ZWJ + laptop)
// Use the 'v' flag or handle sequences explicitly:
const emojiSeq = /\p{Emoji_Modifier_Base}\p{Emoji_Modifier}|\p{Emoji_Presentation}(?:\u{FE0F}\u{20E3}?|[\u{1F3FB}-\u{1F3FF}])?(?:\u{200D}(?:\p{Emoji_Presentation}|\p{Emoji_Modifier_Base}\p{Emoji_Modifier}|\p{Regional_Indicator}{2}|\p{Emoji_Presentation}\u{FE0F}?))*\u{FE0F}?/gu;

For reliable emoji sequence matching, use the Intl.Segmenter API (discussed in article 4) or a dedicated library like graphemer rather than building your own regex.

The v flag (ES2024+)

The v flag is the successor to u, adding set notation and improved Unicode property support:

// Set intersection with v flag
/[\p{Letter}&&\p{ASCII}]/v  // ASCII letters only

// Set subtraction
/[\p{Letter}--\p{ASCII}]/v  // non-ASCII letters only

// Nested classes
/[\p{Script=Latin}&&[A-Z]]/v  // uppercase Latin letters

The v flag is incompatible with u — use one or the other, not both.

Python: re and regex Modules

The re module with Unicode

Python 3's re module treats string patterns as Unicode by default. \w, \d, and \s match Unicode equivalents:

import re

# Python 3: \w matches Unicode letters, digits, and underscore
re.match(r'^\w+$', 'café')    # matches — é is a word character
re.match(r'^\w+$', '名前')     # matches — CJK are word characters
re.match(r'^\d+$', '١٢٣')    # matches — Arabic-Indic digits

# Explicit Unicode flag (redundant in Python 3 str mode, needed for bytes)
re.match(r'^\w+$', 'hello', re.UNICODE)

However, Python's re does not support Unicode property escapes (\p{}). For those, you need the third-party regex module.

The regex module

pip install regex
import regex

# Unicode property escapes
regex.match(r'^\p{L}+$', 'café')    # matches
regex.match(r'^\p{L}+$', '名前')     # matches
regex.match(r'^\p{L}+$', '123')     # no match

# Script matching
regex.match(r'^\p{Script=Latin}+$', 'hello')     # matches
regex.match(r'^\p{Script=Arabic}+$', 'مرحبا')    # matches
regex.match(r'^\p{Script=Han}+$', '你好')         # matches

# Unicode categories
regex.findall(r'\p{Lu}', 'Hello World')   # ['H', 'W'] — uppercase letters
regex.findall(r'\p{Nd}', 'abc123٤٥٦')    # ['1','2','3','٤','٥','٦'] — decimal digits

# Grapheme clusters with \X
regex.findall(r'\X', 'café')  # ['c', 'a', 'f', 'é'] — proper grapheme splitting
regex.findall(r'\X', '👨‍💻')   # ['👨‍💻'] — full ZWJ sequence as one unit

The \X metacharacter in regex is invaluable — it matches one full grapheme cluster (what a user perceives as one character), handling combining marks, ZWJ sequences, and regional indicators correctly.

Unicode Property Categories Reference

The most useful general categories for validation:

Property Shorthand Description
Letter L Any letter (Lu + Ll + Lt + Lm + Lo)
Uppercase_Letter Lu Uppercase letters: A, Ö, Ш
Lowercase_Letter Ll Lowercase letters: a, ö, ш
Decimal_Number Nd Decimal digits in any script
Connector_Punctuation Pc Underscore and similar
Mark M Combining marks (diacritics)
Separator Z Spaces and separators
Other C Control characters, surrogates
Emoji_Presentation Default emoji rendering
White_Space All Unicode whitespace

Common Validation Patterns

Name validation (any language)

// Allow letters and marks from any script, plus spaces and hyphens
const namePattern = /^[\p{L}\p{M}][\p{L}\p{M}\s\-']{0,99}$/u;

namePattern.test("Alice")           // true
namePattern.test("María José")      // true
namePattern.test("李明")             // true
namePattern.test("محمد")            // true
namePattern.test("O'Brien")         // true
namePattern.test("Müller-Schmidt")  // true
namePattern.test("123")             // false
namePattern.test("<script>")        // false

Numeric input (any script's digits)

// Match digits from any Unicode script
const anyDigits = /^\p{Nd}+$/u;

anyDigits.test("12345")    // true — ASCII
anyDigits.test("١٢٣٤٥")   // true — Arabic-Indic
anyDigits.test("१२३")      // true — Devanagari
anyDigits.test("abc")      // false

// Convert non-ASCII digits to ASCII before numeric parsing:
function normalizeDigits(str) {
  return str.replace(/\p{Nd}/gu, d => String(d.codePointAt(0) % 10));
}
normalizeDigits("١٢٣")  // "123"

Detecting script mixing (homograph detection)

// Check if a string mixes scripts — potential homograph attack indicator
function detectScripts(str) {
  const scripts = new Set();
  for (const char of str) {
    for (const script of ['Latin', 'Cyrillic', 'Greek', 'Arabic', 'Han']) {
      if (new RegExp(`\\p{Script=${script}}`, 'u').test(char)) {
        scripts.add(script);
        break;
      }
    }
  }
  return [...scripts];
}

detectScripts("example.com")  // ["Latin"]
detectScripts("еxample.com")  // ["Cyrillic", "Latin"] — first е is Cyrillic!

Common Pitfalls

Pitfall 1: Forgetting the u flag in JavaScript

// This silently gives wrong results for non-BMP characters:
/^.$/.test("😀")    // false — without u, . matches one UTF-16 unit
/^.$/u.test("😀")   // true — with u, . matches one code point

Pitfall 2: Length vs. visual character count

// String.length counts UTF-16 code units, not code points, not graphemes:
"😀".length   // 2 (surrogate pair)
"é".length    // can be 1 or 2 depending on normalization form

// For character counting: use Intl.Segmenter or spread operator:
[..."😀"].length   // 1 — code points

Pitfall 3: \w does not match all Unicode letters in JavaScript

// In JavaScript, \w is ALWAYS [a-zA-Z0-9_] even with the u flag:
/\w/u.test("é")   // false — \w doesn't expand to Unicode letters
/\p{L}/u.test("é")  // true — use property escapes instead

// Python's \w DOES match Unicode letters:
re.match(r'\w', 'é')   # matches  Python \w is Unicode-aware

Pitfall 4: Combining characters and string length

// "é" can be one code point (precomposed) or two (base + combining accent)
const precomposed = "é";      // U+00E9
const decomposed  = "e\u0301"; // U+0065 + U+0301

precomposed.length  // 1
decomposed.length   // 2
precomposed === decomposed  // false

// They look identical but are different strings — normalize before comparing:
precomposed.normalize('NFC') === decomposed.normalize('NFC')  // true

Pitfall 5: re.DOTALL and re.MULTILINE in Python

import re

# Without re.DOTALL, . doesn't match \n
re.match(r'.+', 'line1\nline2')   # matches only 'line1'

# Unicode newlines include \r, \n, \r\n, \u2028, \u2029
# re.DOTALL makes . match any character including \n:
re.match(r'.+', 'line1\nline2', re.DOTALL)  # matches all

Putting It Together: Internationalized Slug Generator

// Generate URL slugs from any Unicode text
function slugify(text) {
  return text
    .normalize('NFC')                          // normalize combining chars
    .toLowerCase()
    .replace(/[\p{P}\p{S}]/gu, ' ')            // remove punctuation/symbols
    .trim()
    .replace(/\s+/g, '-')                      // spaces to hyphens
    .replace(/[^\p{L}\p{N}\-]/gu, '')          // remove non-letter/digit/hyphen
    .replace(/-+/g, '-')                       // collapse multiple hyphens
    .replace(/^-|-$/g, '');                    // strip leading/trailing hyphens
}

slugify("Hello, World!")         // "hello-world"
slugify("Ça va? Très bien!")     // "ça-va-très-bien"
slugify("日本語のタイトル")          // "日本語のタイトル"
slugify("مرحبا بالعالم")         // "مرحبا-بالعالم"

Password and Input Validation Patterns

Unicode-aware regex opens up more precise validation rules:

// Password: require at least one Unicode letter and one digit
function isStrongPassword(password) {
  const hasLetter = /\p{L}/u.test(password);
  const hasDigit  = /\p{Nd}/u.test(password);
  const longEnough = password.length >= 8;
  return hasLetter && hasDigit && longEnough;
}

// Username: letters, numbers, dots, hyphens, underscores — any script
function isValidUsername(username) {
  return /^[\p{L}\p{N}][\p{L}\p{N}._\-]{1,29}$/u.test(username);
}
isValidUsername("alice")        // true
isValidUsername("山田_太郎")     // true
isValidUsername("123abc")       // false — must start with letter or digit first char OK

// Phone number: allow digits from any numeral system
function normalizePhone(input) {
  // Replace any Unicode decimal digit with its ASCII equivalent
  const digits = input.replace(/\p{Nd}/gu, d => String(d.codePointAt(0) % 10));
  return digits.replace(/\D/g, '');  // strip non-digits
}
normalizePhone("+1 (555) 123-٤٥٦٧")  // "15551234567"

Whitespace Detection

Unicode defines many more whitespace characters than ASCII:

// \s in JavaScript (with u flag) matches:
// U+0009 TAB, U+000A LF, U+000B VT, U+000C FF, U+000D CR,
// U+0020 SPACE, U+00A0 NBSP, U+FEFF BOM,
// and Unicode Zs category spaces

// Check for any Unicode whitespace (including non-breaking, em space, etc.)
function hasUnicodeWhitespace(text) {
  return /[\p{Z}\t\r\n]/u.test(text);
}

// Strip all Unicode whitespace (not just ASCII)
function stripAllWhitespace(text) {
  return text.replace(/[\p{Z}\t\r\n]+/gu, '');
}

stripAllWhitespace("hello\u2003world\u00A0!")  // "helloworld!"
// \u2003 = EM SPACE, \u00A0 = NO-BREAK SPACE

Use the SymbolFYI Character Counter to inspect the Unicode properties of any character you are trying to match with regex — the property panel shows the General Category, Script, and Block for any input character.


Next in Series: JavaScript and Unicode: Strings, Code Points, and Grapheme Clusters — deep-diving into JavaScript's UTF-16 string model, surrogate pairs, and how to handle string operations correctly for all Unicode characters.

相关符号

相关术语

相关工具

更多指南