CJK Unification: How Unicode Handles Chinese, Japanese, and Korean

Unicode Deep Dive Unicode Deep Dive พ.ค. 30, 2023

○ 1. What Is Unicode? The Universal Character Standard Explained
○ 2. Unicode Planes and Blocks: How 1.1 Million Code Points Are Organized
○ 3. Unicode Encodings Explained: UTF-8, UTF-16, and UTF-32 Compared
○ 4. Unicode Normalization: NFC, NFD, NFKC, and NFKD Explained
○ 5. Unicode Properties and Categories: Classifying Every Character
○ 6. Bidirectional Text in Unicode: How RTL and LTR Scripts Coexist
○ 7. How Emoji Work in Unicode: From Code Points to Skin Tones
● 8. CJK Unification: How Unicode Handles Chinese, Japanese, and Korean
○ 9. Unicode Version History: From 1.0 to 16.0 and Beyond
○ 10. Unicode CLDR: The Database Behind Every Localized App

สารบัญ

The single most controversial decision in Unicode's history is Han Unification — the choice to assign a single code point to ideographic characters that are semantically identical but may have different visual forms across Chinese, Japanese, and Korean writing systems. Understanding why this decision was made, what it means for developers, and how to work with it correctly is essential for anyone building multilingual applications that serve East Asian users.

cjk-ideographs">Background: What Are CJK Ideographs?

Chinese, Japanese, and Korean writing systems all use logographic characters descended from ancient Chinese script. A single character represents a morpheme (a unit of meaning), not a sound.

Chinese (Mandarin, Cantonese, etc.) uses ideographs as its primary script — Simplified Chinese (mainland China) or Traditional Chinese (Taiwan, Hong Kong, Macau)
Japanese uses three scripts: Hiragana (syllabic, for native Japanese words), Katakana (syllabic, for loanwords), and Kanji (ideographs, borrowed from Chinese)
Korean primarily uses Hangul (a phonetic alphabet), but still uses some Hanja (ideographs) in formal, scholarly, or older writing

These three writing systems share an enormous number of ideographs with common historical origins, the same meaning, and similar (but not always identical) visual appearance. As of Unicode 16.0, the Unihan database contains 97,058 unified CJK characters.

What Is Han Unification?

The core principle of Han Unification is: if two characters across different writing systems have the same semantic meaning and derivable common origin, they are assigned a single Unicode code point, even if their visual representation differs slightly by region or historical usage.

This is analogous to saying that the English letter "A", the serif "A", and the italic "A" are all the same character — they differ in presentation, but not in identity. Unicode assigns them all U+0041.

Consider the character meaning "grass/herbs":

Language	Glyph	Local Standard	Unicode
Simplified Chinese	草	GB 18030	U+8349
Traditional Chinese	草	Big5	U+8349
Japanese	草	JIS	U+8349
Korean	草	KS C 5601	U+8349

All four representations are unified to U+8349. In most fonts and display contexts, they look nearly identical.

Where CJK Lives in Unicode

CJK characters are distributed across several Unicode blocks:

Block	Range	Count	Notes
CJK Unified Ideographs	U+4E00–U+9FFF	20,902	Core block (BMP)
CJK Extension A	U+3400–U+4DBF	6,592	Rare characters (BMP)
CJK Unified Ideographs (continued)	U+F900–U+FAFF	512	Compatibility ideographs
CJK Extension B	U+20000–U+2A6DF	42,720	Plane 2 (SIP)
CJK Extension C	U+2A700–U+2B73F	4,149	Plane 2
CJK Extension D	U+2B740–U+2B81F	222	Plane 2
CJK Extension E	U+2B820–U+2CEAF	5,762	Plane 2
CJK Extension F	U+2CEB0–U+2EBEF	7,473	Plane 2
CJK Extension G	U+30000–U+3134F	4,939	Plane 3 (TIP)
CJK Extension H	U+31350–U+323AF	4,192	Plane 3
CJK Extension I	U+2EBF0–U+2EE5F	622	Plane 2 (Unicode 15.1)

The core block (U+4E00–U+9FFF) contains the 20,902 characters needed for everyday Chinese, Japanese, and Korean text. The extension blocks add increasingly rare characters used in classical texts, historical documents, and personal names.

The Unihan Database

The Unihan Database (Unicode Han Database) is the comprehensive repository of metadata for all unified CJK characters. It maps each code point to its properties in multiple East Asian standards:

Mandarin pronunciation (Pinyin)
Cantonese pronunciation (Jyutping)
Japanese On-reading and Kun-reading
Korean pronunciation
Vietnamese pronunciation (Chữ Nôm)
Semantic definition
Stroke count and radical
Frequency and usage data
Source mappings (GB 18030, JIS, Big5, KS, etc.)

# Access Unihan data via the 'unihan-etl' package
# pip install unihan-etl

# Or use the unicodedata module for basic properties
import unicodedata

char = '草'  # U+8349
print(unicodedata.name(char))       # CJK UNIFIED IDEOGRAPH-8349
print(unicodedata.category(char))   # Lo (Other Letter)
print(ord(char))                    # 33609
print(hex(ord(char)))               # 0x8349

For full Unihan data access:

# Using the 'cjklib' package or raw Unihan data files
# The Unihan data is available at: https://unicode.org/Public/UCD/latest/ucd/Unihan.zip

# Simple lookup using unicodedataplus package
# pip install unicodedataplus
import unicodedataplus as udp

# Get reading data (if available)
# udp.unihan_lookup('草', 'kMandarin')  # 'CǍO'
# udp.unihan_lookup('草', 'kJapaneseOn')  # 'SOU'
# udp.unihan_lookup('草', 'kJapaneseKun') # 'KUSA'

Use our Unicode Lookup tool to look up any CJK character and see its Unihan properties directly.

The Controversy

Han Unification is deeply controversial, and the criticisms are legitimate:

Glyph Differences Are Significant

Some "unified" characters have visually distinct forms in different national standards. The character 骨 (bone) has one shape in traditional Chinese forms and a slightly different stroke arrangement in the standard Japanese rendering. To a typographer or calligrapher, these are different characters that merely happen to share the same Unicode code point.

Critics argue that Unicode traded precision for storage efficiency — sacrificing the ability to reliably reproduce locale-correct typography in plain text.

The Japanese User Experience

Japanese users are particularly affected. Japanese text typically uses a mix of Hiragana, Katakana, and Kanji. When a document uses Chinese-locale fonts for Kanji (because of incorrect font fallback or missing locale declaration), Japanese users see characters in Chinese style that looks "wrong" — similar to how using an Italian handwriting font for English text would look unusual even if technically readable.

Simplified vs. Traditional Chinese

Even within Chinese, Simplified and Traditional forms are distinct enough that the unification decision is debatable. The character 爱 (Simplified, "love") and 愛 (Traditional) are not unified — they are different enough to warrant different code points. But many other pairs that are arguably similarly distinct were unified.

The Solution: Language Tags and Font Selection

Unicode's answer to the controversy is that glyph selection is the job of the rendering layer, not the encoding layer. The code point identifies the character abstractly; the font and locale determine how it is rendered.

HTML `lang` Attribute

The most important tool is the HTML lang attribute:

<!-- Chinese (Simplified) — uses simplified/mainland glyph forms -->
<html lang="zh-Hans">
  <p>草 骨 角 直</p>
</html>

<!-- Chinese (Traditional) — uses traditional/HK/TW glyph forms -->
<html lang="zh-Hant">
  <p>草 骨 角 直</p>
</html>

<!-- Japanese — uses Japanese glyph forms -->
<html lang="ja">
  <p>草 骨 角 直</p>
</html>

<!-- Korean — uses Korean glyph forms (Hanja) -->
<html lang="ko">
  <p>草 骨 角 直</p>
</html>

<!-- Mixed content: override locale for specific spans -->
<p lang="ja">
  The Japanese word
  <span lang="zh-Hans">直</span>  <!-- Display in Chinese form -->
  versus the Japanese form 直.
</p>

CSS font-family and CJK Fonts

Browsers use the lang attribute to select appropriate font fallbacks. When you specify generic font families, the browser maps them to locale-appropriate fonts:

/* Generic sans-serif on different platforms/locales:
   - Chinese: Noto Sans CJK SC (Simplified) or TC (Traditional)
   - Japanese: Hiragino Kaku Gothic (macOS), Meiryo (Windows)
   - Korean: Apple SD Gothic Neo (macOS), Malgun Gothic (Windows)
*/

body {
    font-family: -apple-system, BlinkMacSystemFont, "Segoe UI",
                 "Hiragino Sans", "Noto Sans CJK JP", sans-serif;
}

.zh-hans { font-family: "Noto Sans CJK SC", "PingFang SC", sans-serif; }
.zh-hant { font-family: "Noto Sans CJK TC", "PingFang TC", sans-serif; }
.ja      { font-family: "Noto Sans CJK JP", "Hiragino Kaku Gothic ProN", sans-serif; }
.ko      { font-family: "Noto Sans CJK KR", "Apple SD Gothic Neo", sans-serif; }

Unicode Variation Sequences for CJK

Unicode provides a mechanism for specifying specific glyph variants of CJK characters using Ideographic Variation Sequences (IVS). An IVS consists of a CJK character followed by a Variation Selector from the range U+E0100–U+E01EF.

Different Ideographic Variation Databases (IVD) define which variants are available: - Adobe-Japan1: Variants for Japanese printing and publishing - Moji_Joho: Japanese government character standardization - Hanyo-Denshi: Japanese electronic distribution variants

# Variation Selector example (conceptual)
# 辻 (U+8FBB) has two forms in Japanese: one with one 辻 and one with two 辻
base = '\u8FBB'              # 辻 base character
vs1  = '\uE0100'             # Variation Selector (from IVD)

standard  = base             # Standard form
variant   = base + vs1       # Variant form (requires supporting font)

print(len(standard))   # 1
print(len(variant))    # 2 (base + variation selector)

IVS support requires specialized fonts and is used primarily in publishing workflows, not general web development.

CJK Radicals and Stroke Data

CJK characters are traditionally organized by radicals — root components used in dictionary indexing. The Unihan database maps each character to its Kangxi radical (there are 214 of them) and its stroke count within that radical.

Unicode also includes dedicated blocks for: - Kangxi Radicals (U+2F00–U+2FDF): The 214 traditional dictionary radicals - CJK Radicals Supplement (U+2E80–U+2EFF): Additional radical forms - CJK Strokes (U+31C0–U+31EF): Individual stroke components

These are primarily used in dictionary applications and character learning tools, not in normal text rendering.

Searching and Collation Across CJK

One of the practical challenges Han Unification creates is collation (sorting). The "natural" order of CJK characters depends heavily on context:

Stroke count order: Traditional Chinese dictionaries sort by number of strokes, then by radical
Pinyin order: Simplified Chinese often sorts by pronunciation (Pinyin romanization)
Reading order: Japanese sorts by Kana reading (On or Kun)
Unicode code point order: Only useful for consistent machine sorting; carries no linguistic meaning

import locale

# Python's locale-aware sorting for Chinese
# Requires the appropriate locale to be installed on the system
import locale

words_zh = ['北京', '上海', '广州', '成都', '重庆']

# Stroke-count based sorting (simplified, would require Unihan data)
# In practice, use ICU (via PyICU) for correct CJK collation
try:
    locale.setlocale(locale.LC_ALL, 'zh_CN.UTF-8')
    sorted_words = sorted(words_zh, key=locale.strxfrm)
    print(sorted_words)
except locale.Error:
    print("Locale not available — use PyICU for CJK collation")

For robust CJK collation in production, use: - Python: pyicu (Python bindings to the ICU library) - JavaScript: Intl.Collator with the appropriate locale - Java: java.text.Collator with Chinese/Japanese locale

// Correct CJK collation with Intl.Collator
const words = ['北京', '上海', '广州', '成都'];

// Pinyin order (Simplified Chinese)
const pinyin = new Intl.Collator('zh-Hans', { sensitivity: 'base' });
console.log(words.sort((a, b) => pinyin.compare(a, b)));
// ['北京', '成都', '广州', '上海'] (approximate Pinyin order)

// Japanese kanji by reading
const kanji = ['東京', '大阪', '京都', '名古屋'];
const japanese = new Intl.Collator('ja');
console.log(kanji.sort((a, b) => japanese.compare(a, b)));

Japanese-Specific Considerations

Japanese text processing has additional complexities beyond Han Unification:

Three Scripts in One Text

Modern Japanese text freely mixes all three scripts:

私はJavaScriptが大好きです。
(I love JavaScript very much.)

私 — Kanji (I)
は — Hiragana (topic marker particle)
JavaScript — Latin script
が — Hiragana (subject marker particle)
大好き — Kanji + Hiragana (love/like)
です — Hiragana (polite copula)
。 — Japanese period (U+3002)

Detecting word boundaries in Japanese requires morphological analysis because there are no spaces. Libraries like MeCab (Python: mecab-python3) perform this segmentation.

Halfwidth and Fullwidth Forms

Japanese typography distinguishes between: - Fullwidth (全角): Characters in a square em box — Ａ (U+FF21) - Halfwidth (半角): Narrower variants — ｦ (halfwidth Katakana)

The Halfwidth and Fullwidth Forms block (U+FF00–U+FFEF) contains compatibility forms. NFKC normalization maps fullwidth ASCII (Ａ) to standard ASCII (A). See Unicode Normalization for details.

Chinese Text: Simplified vs. Traditional

For web applications serving both mainland Chinese and Taiwanese/Hong Kong users, the Simplified/Traditional distinction is critical:

<!-- Specify precisely to ensure correct font and encoding -->
<html lang="zh-Hans">  <!-- Simplified Chinese -->
<html lang="zh-Hant">  <!-- Traditional Chinese -->
<html lang="zh-HK">    <!-- Chinese as used in Hong Kong -->
<html lang="zh-TW">    <!-- Chinese as used in Taiwan -->

Many characters are distinct between Simplified and Traditional (encoded as separate Unicode code points), while others share a code point and rely on the lang attribute for correct font rendering.

Converting between Simplified and Traditional requires more than character substitution — some words differ entirely, and ambiguous cases require context. Libraries like opencc (Open Chinese Convert) handle this:

# pip install opencc-python-reimplemented
import opencc

converter_s2t = opencc.OpenCC('s2t')  # Simplified to Traditional
converter_t2s = opencc.OpenCC('t2s')  # Traditional to Simplified

simplified = "我爱你"
traditional = converter_s2t.convert(simplified)
print(traditional)  # 我愛你

Summary

CJK Unification is one of the most consequential and contested decisions in the Unicode standard. The key points for developers:

97,058 CJK ideographs are unified in Unicode across Chinese, Japanese, and Korean
One code point, multiple glyphs: The same U+XXXX may display differently in Chinese vs. Japanese fonts
Language tags are essential: Always set lang="zh-Hans", lang="ja", or lang="ko" on HTML elements containing CJK text
Font selection follows locale: Browsers use the lang attribute to choose appropriate CJK fonts
Collation is locale-specific: Use Intl.Collator (JavaScript) or ICU (Python/Java) for CJK sorting
Simplified vs. Traditional are distinct: Many characters have separate code points; others share one and rely on font/locale

The Unihan database, accessible via our Unicode Lookup tool, provides comprehensive data on all CJK characters including their properties in each national standard.

Next in Series: Unicode Version History: From 1.0 to 16.0 and Beyond — A complete timeline of Unicode's growth from 7,161 characters in 1991 to over 154,000 today.