SymbolFYI

Unihan Database

Unicode Standard
Định nghĩa

A comprehensive database of CJK ideographs with readings, meanings, and variant information maintained by the Unicode Consortium.

What Is the Unihan Database?

Unihan (short for Unicode Han) is the Unicode Consortium's comprehensive database of information about the CJK (Chinese, Japanese, Korean) unified ideographs encoded in Unicode. The database is formally known as the Unicode Han Database and is published as part of the Unicode Character Database (UCD). It provides detailed linguistic, phonetic, semantic, and bibliographic data about each of the approximately 100,000 Han ideographs currently in Unicode.

The Unihan database is one of the largest and most complex components of the UCD, reflecting the enormous depth and historical breadth of Chinese character scholarship.

Database Structure

Unihan data is distributed as a set of tab-separated .txt files, collectively available as Unihan.zip from the Unicode website. As of Unicode 16.0, the database contains over 100 property types organized into several thematic files:

File Contents
Unihan_Readings.txt Pronunciations in Mandarin (Pinyin), Cantonese, Japanese On/Kun, Korean, Vietnamese
Unihan_Meanings.txt English definitions and semantic tags
Unihan_IRGSources.txt Source references for Han unification (G=China, J=Japan, K=Korea, etc.)
Unihan_RadicalStrokeCounts.txt KangXi radical number and additional stroke count
Unihan_DictionaryLikeData.txt Frequency ranks, grade levels, stroke order
Unihan_OtherMappings.txt Mappings to GB2312, Big5, JIS, KS X 1001
Unihan_Variants.txt Semantic and compatibility variant relationships

Key Properties

Readings

  • kMandarin: Pinyin pronunciation (e.g., zhōng for 中)
  • kCantonese: Jyutping pronunciation
  • kJapaneseOn: Japanese on'yomi (Chinese-derived reading)
  • kJapaneseKun: Japanese kun'yomi (native Japanese reading)
  • kKorean: Korean romanization
  • kVietnamese: Vietnamese reading

Structure

  • kTotalStrokes: Total stroke count
  • kRSUnicode: KangXi radical + residual stroke count
  • kRSKangXi: Classical radical-stroke index

Meaning

  • kDefinition: English gloss/definition
  • kFrequency: Frequency rank (1=most frequent) based on corpus analysis
  • kGradeLevel: Taiwan school grade at which character is introduced

Using Unihan Data in Code

# Download Unihan.zip from https://www.unicode.org/Public/UCD/latest/ucd/
# and parse Unihan_Readings.txt
import csv

unihan_readings = {}
with open('Unihan_Readings.txt', encoding='utf-8') as f:
    for line in f:
        if line.startswith('#') or not line.strip():
            continue
        code_str, prop, value = line.strip().split('\t')
        cp = int(code_str[2:], 16)  # Strip 'U+'
        char = chr(cp)
        if char not in unihan_readings:
            unihan_readings[char] = {}
        unihan_readings[char][prop] = value

# Look up readings for 中
char = '中'
if char in unihan_readings:
    print(unihan_readings[char].get('kMandarin'))   # 'zhōng'
    print(unihan_readings[char].get('kJapaneseOn')) # 'CHUU'
    print(unihan_readings[char].get('kKorean'))     # 'CWUNG'

Unihan and Han Unification

The Unihan database is also the technical record of the Han Unification process — the controversial but pragmatic decision to encode semantically equivalent characters from Chinese, Japanese, and Korean as single Unicode code points. The kIRGSources property traces each character back to its source standard (CNS 11643, GB 2312, JIS X 0208, KS X 1001, etc.), documenting the source glyph that justified each code point's inclusion.

Ký hiệu liên quan

Thuật ngữ liên quan

Công cụ liên quan

Hướng dẫn liên quan