SymbolFYI

Script

Unicode Standard
Definition

A Unicode property indicating which writing system a character belongs to (e.g., Latin, Greek, Common, Inherited).

What Is a Unicode Script?

The Script property in Unicode identifies which writing system or script a character belongs to. Unlike a Unicode block (which is defined purely by code point range), the Script property is a semantic assignment based on linguistic heritage. A single Unicode block may contain characters from multiple scripts, and a script's characters may be spread across multiple blocks.

As of Unicode 16.0, over 160 scripts are defined, ranging from widely used scripts like Latin, Arabic, and Han to historic scripts like Linear B, Phoenician, and Cuneiform.

Common Scripts

Script Description Example Characters
Latin Used by European and many world languages A–Z, Ñ, Ü
Cyrillic Russian, Bulgarian, Serbian, etc. А, Б, В
Arabic Arabic, Persian, Urdu (RTL) ا, ب, ت
Han Chinese, Japanese (Kanji), Korean (Hanja) 中, 日, 韓
Hangul Korean syllables 가, 나, 다
Hiragana / Katakana Japanese phonetic syllabaries あ, ア
Devanagari Hindi, Sanskrit, Marathi अ, आ
Greek Greek alphabet α, β, Ω
Hebrew Hebrew, Yiddish (RTL) א, ב

Common vs. Inherited Scripts

Two special script values deserve attention:

  • Common (Zyyy): Characters shared across scripts, such as digits 0–9, punctuation, spaces, and most symbols. These characters do not belong to any single script.
  • Inherited (Zinh): Characters that inherit their script from the preceding character — primarily combining marks and diacritics.

Using Script Properties in Code

// ES2018+ Unicode property escapes allow script-based matching
const isLatin = /^[\p{Script=Latin}]+$/u;
const isCyrillic = /^[\p{Script=Cyrillic}]+$/u;
const isHan = /^[\p{Script=Han}]+$/u;
const isArabic = /^[\p{Script=Arabic}]+$/u;

console.log(isLatin.test('Hello'));    // true
console.log(isCyrillic.test('Привет')); // true
console.log(isHan.test('你好'));        // true
console.log(isArabic.test('مرحبا'));    // true

// Script_Extensions: some characters are used in multiple scripts
// Example: U+0951 (Devanagari stress sign) also used in other Indic scripts
const hasLatinExt = /\p{Script_Extensions=Latin}/u;
# Python's regex module (not re) supports Unicode script properties
import regex  # pip install regex

pattern = regex.compile(r'^\p{Script=Latin}+$')
print(pattern.match('Hello'))    # match
print(pattern.match('Привет'))   # None

# Check script of a single character
print(regex.match(r'\p{Script=Han}', '中'))  # match

Script Detection and Security

Script detection is critical for security applications. IDN homograph attacks exploit visually similar characters from different scripts to create misleading domain names — for example, replacing the Latin a with the Cyrillic а (Cyrillic small letter a, U+0430). Modern browsers use script mixing rules to warn about or block such domains.

Unicode's Recommended Scripts and Identifier Types data help developers build safe identifier validators that detect suspicious script mixing.

Script Extensions

The Script_Extensions property was added to handle characters that are legitimately used in multiple scripts. For example, U+0964 (Devanagari Danda, the phrase-ending period) is used in dozens of Indic scripts. Script_Extensions lists all scripts that conventionally use a given character, providing more accurate script detection than the single-value Script property.

Verwandte Symbole

Verwandte Begriffe

Verwandte Werkzeuge

Verwandte Anleitungen