SymbolFYI

Unihan Database

Unicode Standard

Định nghĩa

A comprehensive database of CJK ideographs with readings, meanings, and variant information maintained by the Unicode Consortium.

What Is the Unihan Database?

Unihan (short for Unicode Han) is the Unicode Consortium's comprehensive database of information about the CJK (Chinese, Japanese, Korean) unified ideographs encoded in Unicode. The database is formally known as the Unicode Han Database and is published as part of the Unicode Character Database (UCD). It provides detailed linguistic, phonetic, semantic, and bibliographic data about each of the approximately 100,000 Han ideographs currently in Unicode.

The Unihan database is one of the largest and most complex components of the UCD, reflecting the enormous depth and historical breadth of Chinese character scholarship.

Database Structure

Unihan data is distributed as a set of tab-separated .txt files, collectively available as Unihan.zip from the Unicode website. As of Unicode 16.0, the database contains over 100 property types organized into several thematic files:

File	Contents
`Unihan_Readings.txt`	Pronunciations in Mandarin (Pinyin), Cantonese, Japanese On/Kun, Korean, Vietnamese
`Unihan_Meanings.txt`	English definitions and semantic tags
`Unihan_IRGSources.txt`	Source references for Han unification (G=China, J=Japan, K=Korea, etc.)
`Unihan_RadicalStrokeCounts.txt`	KangXi radical number and additional stroke count
`Unihan_DictionaryLikeData.txt`	Frequency ranks, grade levels, stroke order
`Unihan_OtherMappings.txt`	Mappings to GB2312, Big5, JIS, KS X 1001
`Unihan_Variants.txt`	Semantic and compatibility variant relationships

Key Properties

Readings

kMandarin: Pinyin pronunciation (e.g., zhōng for 中)
kCantonese: Jyutping pronunciation
kJapaneseOn: Japanese on'yomi (Chinese-derived reading)
kJapaneseKun: Japanese kun'yomi (native Japanese reading)
kKorean: Korean romanization
kVietnamese: Vietnamese reading

Structure

kTotalStrokes: Total stroke count
kRSUnicode: KangXi radical + residual stroke count
kRSKangXi: Classical radical-stroke index

Meaning

kDefinition: English gloss/definition
kFrequency: Frequency rank (1=most frequent) based on corpus analysis
kGradeLevel: Taiwan school grade at which character is introduced

Using Unihan Data in Code

# Download Unihan.zip from https://www.unicode.org/Public/UCD/latest/ucd/
# and parse Unihan_Readings.txt
import csv

unihan_readings = {}
with open('Unihan_Readings.txt', encoding='utf-8') as f:
    for line in f:
        if line.startswith('#') or not line.strip():
            continue
        code_str, prop, value = line.strip().split('\t')
        cp = int(code_str[2:], 16)  # Strip 'U+'
        char = chr(cp)
        if char not in unihan_readings:
            unihan_readings[char] = {}
        unihan_readings[char][prop] = value

# Look up readings for 中
char = '中'
if char in unihan_readings:
    print(unihan_readings[char].get('kMandarin'))   # 'zhōng'
    print(unihan_readings[char].get('kJapaneseOn')) # 'CHUU'
    print(unihan_readings[char].get('kKorean'))     # 'CWUNG'

Unihan and Han Unification

The Unihan database is also the technical record of the Han Unification process — the controversial but pragmatic decision to encode semantically equivalent characters from Chinese, Japanese, and Korean as single Unicode code points. The kIRGSources property traces each character back to its source standard (CNS 11643, GB 2312, JIS X 0208, KS X 1001, etc.), documenting the source glyph that justified each code point's inclusion.

Unihan Database

What Is the Unihan Database?

Database Structure

Key Properties

Readings

Structure

Meaning

Using Unihan Data in Code

Unihan and Han Unification

Ký hiệu liên quan

Thuật ngữ liên quan

Công cụ liên quan

Hướng dẫn liên quan