SymbolFYI

CJK

Unicode Standard
परिभाषा

Abbreviation for Chinese, Japanese, and Korean — refers to the unified set of ideographic characters shared across these writing systems.

What Is CJK?

CJK stands for Chinese, Japanese, and Korean — the three major East Asian writing systems that share a substantial set of logographic characters derived from classical Chinese writing. In Unicode, these shared characters are handled through a process called Han Unification, which merges characters that have the same semantic meaning and similar enough shapes into single code points, even though the exact glyph shapes may differ between Chinese, Japanese, and Korean typographic traditions.

The primary CJK character block in Unicode is CJK Unified Ideographs (U+4E00U+9FFF), containing 20,902 characters — enough to cover everyday literacy needs in all three languages.

Han Unification

Han unification was a deliberate and somewhat controversial design decision in Unicode 1.0. Rather than encoding the same character three times for Chinese, Japanese, and Korean, Unicode encodes it once. The correct regional glyph variant is then selected by the rendering system based on language tags, locale settings, or the font in use.

For example, the character for "country" is encoded once at U+56FD, but the Chinese simplified form (国), traditional Chinese form (國), and Japanese kanji form (国) differ subtly in stroke details. Well-designed fonts handle this through OpenType language features.

CJK Blocks in Unicode

Block Range Count Notes
CJK Unified Ideographs U+4E00-9FFF 20,902 Core block
CJK Extension A U+3400-4DBF 6,592 Rare ideographs
CJK Extension B U+20000-2A6DF 42,718 Supplementary plane
CJK Extension C–H Various 50,000+ Rare/historic
CJK Compatibility Ideographs U+F900-FAFF 512 Legacy compatibility
Kangxi Radicals U+2F00-2FDF 214 Radical index characters
Bopomofo U+3100-312F Phonetic notation for Mandarin
Katakana / Hiragana U+3040-30FF Japanese syllabaries
Hangul Syllables U+AC00-D7A3 11,172 Precomposed Korean

Working with CJK in Code

import unicodedata

def is_cjk_unified_ideograph(char):
    cp = ord(char)
    return (0x4E00 <= cp <= 0x9FFF or
            0x3400 <= cp <= 0x4DBF or
            0x20000 <= cp <= 0x2A6DF)

print(is_cjk_unified_ideograph('中'))  # True (U+4E2D)
print(is_cjk_unified_ideograph('A'))   # False

# Character name reveals its Unicode designation
print(unicodedata.name('中'))  # 'CJK UNIFIED IDEOGRAPH-4E2D'
print(unicodedata.name('日'))  # 'CJK UNIFIED IDEOGRAPH-65E5'
// Match any CJK unified ideograph using Unicode regex
const cjkRegex = /\p{Script=Han}/gu;
const text = '中文Japanese한국어 mixed text';
const cjkChars = text.match(cjkRegex);
console.log(cjkChars);  // ['中', '文', '語']

CJK Text Rendering Considerations

Font Selection

Correct CJK rendering requires locale-aware font selection. The same Unicode code point may need to display with a different glyph variant depending on whether the content is Simplified Chinese, Traditional Chinese, or Japanese. CSS provides lang attribute support and font-language-override for this purpose.

Text Segmentation

CJK text has no spaces between words, making word segmentation non-trivial. Libraries like jieba (Python) for Chinese, MeCab for Japanese, and KoNLPy for Korean provide morphological analysis for word-boundary detection, which is essential for search indexing and text processing.

संबंधित प्रतीक

संबंधित शब्द

संबंधित टूल

संबंधित गाइड