SymbolFYI

CJK

Unicode Standard

คำจำกัดความ

Abbreviation for Chinese, Japanese, and Korean — refers to the unified set of ideographic characters shared across these writing systems.

What Is CJK?

CJK stands for Chinese, Japanese, and Korean — the three major East Asian writing systems that share a substantial set of logographic characters derived from classical Chinese writing. In Unicode, these shared characters are handled through a process called Han Unification, which merges characters that have the same semantic meaning and similar enough shapes into single code points, even though the exact glyph shapes may differ between Chinese, Japanese, and Korean typographic traditions.

The primary CJK character block in Unicode is CJK Unified Ideographs (U+4E00–U+9FFF), containing 20,902 characters — enough to cover everyday literacy needs in all three languages.

Han Unification

Han unification was a deliberate and somewhat controversial design decision in Unicode 1.0. Rather than encoding the same character three times for Chinese, Japanese, and Korean, Unicode encodes it once. The correct regional glyph variant is then selected by the rendering system based on language tags, locale settings, or the font in use.

For example, the character for "country" is encoded once at U+56FD, but the Chinese simplified form (国), traditional Chinese form (國), and Japanese kanji form (国) differ subtly in stroke details. Well-designed fonts handle this through OpenType language features.

CJK Blocks in Unicode

Block	Range	Count	Notes
CJK Unified Ideographs	`U+4E00-9FFF`	20,902	Core block
CJK Extension A	`U+3400-4DBF`	6,592	Rare ideographs
CJK Extension B	`U+20000-2A6DF`	42,718	Supplementary plane
CJK Extension C–H	Various	50,000+	Rare/historic
CJK Compatibility Ideographs	`U+F900-FAFF`	512	Legacy compatibility
Kangxi Radicals	`U+2F00-2FDF`	214	Radical index characters
Bopomofo	`U+3100-312F`	Phonetic notation for Mandarin
Katakana / Hiragana	`U+3040-30FF`	Japanese syllabaries
Hangul Syllables	`U+AC00-D7A3`	11,172	Precomposed Korean

Working with CJK in Code

import unicodedata

def is_cjk_unified_ideograph(char):
    cp = ord(char)
    return (0x4E00 <= cp <= 0x9FFF or
            0x3400 <= cp <= 0x4DBF or
            0x20000 <= cp <= 0x2A6DF)

print(is_cjk_unified_ideograph('中'))  # True (U+4E2D)
print(is_cjk_unified_ideograph('A'))   # False

# Character name reveals its Unicode designation
print(unicodedata.name('中'))  # 'CJK UNIFIED IDEOGRAPH-4E2D'
print(unicodedata.name('日'))  # 'CJK UNIFIED IDEOGRAPH-65E5'

// Match any CJK unified ideograph using Unicode regex
const cjkRegex = /\p{Script=Han}/gu;
const text = '中文Japanese한국어 mixed text';
const cjkChars = text.match(cjkRegex);
console.log(cjkChars);  // ['中', '文', '語']

CJK Text Rendering Considerations

Font Selection

Correct CJK rendering requires locale-aware font selection. The same Unicode code point may need to display with a different glyph variant depending on whether the content is Simplified Chinese, Traditional Chinese, or Japanese. CSS provides lang attribute support and font-language-override for this purpose.

Text Segmentation

CJK text has no spaces between words, making word segmentation non-trivial. Libraries like jieba (Python) for Chinese, MeCab for Japanese, and KoNLPy for Korean provide morphological analysis for word-boundary detection, which is essential for search indexing and text processing.

CJK

What Is CJK?

Han Unification

CJK Blocks in Unicode

Working with CJK in Code

CJK Text Rendering Considerations

Font Selection

Text Segmentation

สัญลักษณ์ที่เกี่ยวข้อง

คำที่เกี่ยวข้อง

เครื่องมือที่เกี่ยวข้อง

คู่มือที่เกี่ยวข้อง