What Is CJK?
CJK stands for Chinese, Japanese, and Korean — the three major East Asian writing systems that share a substantial set of logographic characters derived from classical Chinese writing. In Unicode, these shared characters are handled through a process called Han Unification, which merges characters that have the same semantic meaning and similar enough shapes into single code points, even though the exact glyph shapes may differ between Chinese, Japanese, and Korean typographic traditions.
The primary CJK character block in Unicode is CJK Unified Ideographs (U+4E00–U+9FFF), containing 20,902 characters — enough to cover everyday literacy needs in all three languages.
Han Unification
Han unification was a deliberate and somewhat controversial design decision in Unicode 1.0. Rather than encoding the same character three times for Chinese, Japanese, and Korean, Unicode encodes it once. The correct regional glyph variant is then selected by the rendering system based on language tags, locale settings, or the font in use.
For example, the character for "country" is encoded once at U+56FD, but the Chinese simplified form (国), traditional Chinese form (國), and Japanese kanji form (国) differ subtly in stroke details. Well-designed fonts handle this through OpenType language features.
CJK Blocks in Unicode
| Block | Range | Count | Notes |
|---|---|---|---|
| CJK Unified Ideographs | U+4E00-9FFF |
20,902 | Core block |
| CJK Extension A | U+3400-4DBF |
6,592 | Rare ideographs |
| CJK Extension B | U+20000-2A6DF |
42,718 | Supplementary plane |
| CJK Extension C–H | Various | 50,000+ | Rare/historic |
| CJK Compatibility Ideographs | U+F900-FAFF |
512 | Legacy compatibility |
| Kangxi Radicals | U+2F00-2FDF |
214 | Radical index characters |
| Bopomofo | U+3100-312F |
Phonetic notation for Mandarin | |
| Katakana / Hiragana | U+3040-30FF |
Japanese syllabaries | |
| Hangul Syllables | U+AC00-D7A3 |
11,172 | Precomposed Korean |
Working with CJK in Code
import unicodedata
def is_cjk_unified_ideograph(char):
cp = ord(char)
return (0x4E00 <= cp <= 0x9FFF or
0x3400 <= cp <= 0x4DBF or
0x20000 <= cp <= 0x2A6DF)
print(is_cjk_unified_ideograph('中')) # True (U+4E2D)
print(is_cjk_unified_ideograph('A')) # False
# Character name reveals its Unicode designation
print(unicodedata.name('中')) # 'CJK UNIFIED IDEOGRAPH-4E2D'
print(unicodedata.name('日')) # 'CJK UNIFIED IDEOGRAPH-65E5'
// Match any CJK unified ideograph using Unicode regex
const cjkRegex = /\p{Script=Han}/gu;
const text = '中文Japanese한국어 mixed text';
const cjkChars = text.match(cjkRegex);
console.log(cjkChars); // ['中', '文', '語']
CJK Text Rendering Considerations
Font Selection
Correct CJK rendering requires locale-aware font selection. The same Unicode code point may need to display with a different glyph variant depending on whether the content is Simplified Chinese, Traditional Chinese, or Japanese. CSS provides lang attribute support and font-language-override for this purpose.
Text Segmentation
CJK text has no spaces between words, making word segmentation non-trivial. Libraries like jieba (Python) for Chinese, MeCab for Japanese, and KoNLPy for Korean provide morphological analysis for word-boundary detection, which is essential for search indexing and text processing.