What Is a Unicode Script?
The Script property in Unicode identifies which writing system or script a character belongs to. Unlike a Unicode block (which is defined purely by code point range), the Script property is a semantic assignment based on linguistic heritage. A single Unicode block may contain characters from multiple scripts, and a script's characters may be spread across multiple blocks.
As of Unicode 16.0, over 160 scripts are defined, ranging from widely used scripts like Latin, Arabic, and Han to historic scripts like Linear B, Phoenician, and Cuneiform.
Common Scripts
| Script | Description | Example Characters |
|---|---|---|
Latin |
Used by European and many world languages | A–Z, Ñ, Ü |
Cyrillic |
Russian, Bulgarian, Serbian, etc. | А, Б, В |
Arabic |
Arabic, Persian, Urdu (RTL) | ا, ب, ت |
Han |
Chinese, Japanese (Kanji), Korean (Hanja) | 中, 日, 韓 |
Hangul |
Korean syllables | 가, 나, 다 |
Hiragana / Katakana |
Japanese phonetic syllabaries | あ, ア |
Devanagari |
Hindi, Sanskrit, Marathi | अ, आ |
Greek |
Greek alphabet | α, β, Ω |
Hebrew |
Hebrew, Yiddish (RTL) | א, ב |
Common vs. Inherited Scripts
Two special script values deserve attention:
- Common (
Zyyy): Characters shared across scripts, such as digits0–9, punctuation, spaces, and most symbols. These characters do not belong to any single script. - Inherited (
Zinh): Characters that inherit their script from the preceding character — primarily combining marks and diacritics.
Using Script Properties in Code
// ES2018+ Unicode property escapes allow script-based matching
const isLatin = /^[\p{Script=Latin}]+$/u;
const isCyrillic = /^[\p{Script=Cyrillic}]+$/u;
const isHan = /^[\p{Script=Han}]+$/u;
const isArabic = /^[\p{Script=Arabic}]+$/u;
console.log(isLatin.test('Hello')); // true
console.log(isCyrillic.test('Привет')); // true
console.log(isHan.test('你好')); // true
console.log(isArabic.test('مرحبا')); // true
// Script_Extensions: some characters are used in multiple scripts
// Example: U+0951 (Devanagari stress sign) also used in other Indic scripts
const hasLatinExt = /\p{Script_Extensions=Latin}/u;
# Python's regex module (not re) supports Unicode script properties
import regex # pip install regex
pattern = regex.compile(r'^\p{Script=Latin}+$')
print(pattern.match('Hello')) # match
print(pattern.match('Привет')) # None
# Check script of a single character
print(regex.match(r'\p{Script=Han}', '中')) # match
Script Detection and Security
Script detection is critical for security applications. IDN homograph attacks exploit visually similar characters from different scripts to create misleading domain names — for example, replacing the Latin a with the Cyrillic а (Cyrillic small letter a, U+0430). Modern browsers use script mixing rules to warn about or block such domains.
Unicode's Recommended Scripts and Identifier Types data help developers build safe identifier validators that detect suspicious script mixing.
Script Extensions
The Script_Extensions property was added to handle characters that are legitimately used in multiple scripts. For example, U+0964 (Devanagari Danda, the phrase-ending period) is used in dozens of Indic scripts. Script_Extensions lists all scripts that conventionally use a given character, providing more accurate script detection than the single-value Script property.