SymbolFYI

Unicode Block

Unicode Standard

คำจำกัดความ

A contiguous range of code points defined by the Unicode standard, grouping related characters (e.g., 'Arrows' block: U+2190–U+21FF).

What Is a Unicode Block?

A Unicode block is a named, contiguous, non-overlapping range of code points within the Unicode standard. Blocks serve as an organizational tool, grouping characters that belong to the same script, symbol category, or historical/technical purpose. Every assigned code point belongs to exactly one block, and block boundaries are always aligned to multiples of 16 (0x10) code points.

For example: - Basic Latin spans U+0000 to U+007F (128 code points) — the ASCII-compatible range. - CJK Unified Ideographs spans U+4E00 to U+9FFF (20,992 code points) — the core Han ideograph block. - Emoticons spans U+1F600 to U+1F64F (80 code points) — containing face emoji.

As of Unicode 16.0, there are 326 named blocks.

Block Structure and Naming

Blocks are defined in the Blocks.txt data file of the Unicode Character Database (UCD). The naming convention is descriptive and generally reflects the script or character category. Some blocks are densely populated (nearly every code point is assigned), while others are sparse — having a code point within a block does not guarantee the code point is assigned to a character.

import unicodedata

# Python's unicodedata module does not expose block names directly,
# but you can query character properties
char = 'A'
print(unicodedata.name(char))      # 'LATIN CAPITAL LETTER A'
print(unicodedata.category(char))  # 'Lu' (Letter, uppercase)

# Using the 'unicodeblock' third-party package:
# pip install unicodeblock
import unicodeblock.blocks
print(unicodeblock.blocks.of('A'))  # 'BASIC LATIN'
print(unicodeblock.blocks.of(''))  # 'EMOTICONS'

// JavaScript does not have a built-in block lookup,
// but you can use regex Unicode property escapes (ES2018+)
const isBasicLatin = /^\p{Script=Latin}$/u;
console.log(isBasicLatin.test('A')); // true

Important Blocks for Web Developers

Text and Punctuation

Basic Latin (U+0000-007F): ASCII; the backbone of most web content.
Latin Extended-A/B (U+0100-024F): Accented and extended Latin letters.
General Punctuation (U+2000-206F): Em dashes, smart quotes, ellipsis, etc.

Symbols

Miscellaneous Symbols and Pictographs (U+1F300-1F5FF): Weather, nature, objects.
Supplemental Symbols and Pictographs (U+1F900-1F9FF): Newer emoji additions.
Mathematical Operators (U+2200-22FF): Math symbols like ∑, ∞, ≠.

CJK

CJK Unified Ideographs (U+4E00-9FFF): Core 20,902 ideographs.
CJK Extension A-H: Additional ideographs added in later Unicode versions.

Blocks vs. Scripts

Blocks and scripts are related but distinct Unicode properties. A block is defined purely by code point range and is a static organizational division. A script is a property assigned to each individual code point based on the writing system it belongs to. Multiple scripts can appear within the same block, and a single script (like Latin) can span multiple blocks. When doing language detection or text analysis, scripts are generally more useful than blocks.

คำที่เกี่ยวข้อง