What Is the Unihan Database?
Unihan (short for Unicode Han) is the Unicode Consortium's comprehensive database of information about the CJK (Chinese, Japanese, Korean) unified ideographs encoded in Unicode. The database is formally known as the Unicode Han Database and is published as part of the Unicode Character Database (UCD). It provides detailed linguistic, phonetic, semantic, and bibliographic data about each of the approximately 100,000 Han ideographs currently in Unicode.
The Unihan database is one of the largest and most complex components of the UCD, reflecting the enormous depth and historical breadth of Chinese character scholarship.
Database Structure
Unihan data is distributed as a set of tab-separated .txt files, collectively available as Unihan.zip from the Unicode website. As of Unicode 16.0, the database contains over 100 property types organized into several thematic files:
| File | Contents |
|---|---|
Unihan_Readings.txt |
Pronunciations in Mandarin (Pinyin), Cantonese, Japanese On/Kun, Korean, Vietnamese |
Unihan_Meanings.txt |
English definitions and semantic tags |
Unihan_IRGSources.txt |
Source references for Han unification (G=China, J=Japan, K=Korea, etc.) |
Unihan_RadicalStrokeCounts.txt |
KangXi radical number and additional stroke count |
Unihan_DictionaryLikeData.txt |
Frequency ranks, grade levels, stroke order |
Unihan_OtherMappings.txt |
Mappings to GB2312, Big5, JIS, KS X 1001 |
Unihan_Variants.txt |
Semantic and compatibility variant relationships |
Key Properties
Readings
kMandarin: Pinyin pronunciation (e.g.,zhōngfor 中)kCantonese: Jyutping pronunciationkJapaneseOn: Japanese on'yomi (Chinese-derived reading)kJapaneseKun: Japanese kun'yomi (native Japanese reading)kKorean: Korean romanizationkVietnamese: Vietnamese reading
Structure
kTotalStrokes: Total stroke countkRSUnicode: KangXi radical + residual stroke countkRSKangXi: Classical radical-stroke index
Meaning
kDefinition: English gloss/definitionkFrequency: Frequency rank (1=most frequent) based on corpus analysiskGradeLevel: Taiwan school grade at which character is introduced
Using Unihan Data in Code
# Download Unihan.zip from https://www.unicode.org/Public/UCD/latest/ucd/
# and parse Unihan_Readings.txt
import csv
unihan_readings = {}
with open('Unihan_Readings.txt', encoding='utf-8') as f:
for line in f:
if line.startswith('#') or not line.strip():
continue
code_str, prop, value = line.strip().split('\t')
cp = int(code_str[2:], 16) # Strip 'U+'
char = chr(cp)
if char not in unihan_readings:
unihan_readings[char] = {}
unihan_readings[char][prop] = value
# Look up readings for 中
char = '中'
if char in unihan_readings:
print(unihan_readings[char].get('kMandarin')) # 'zhōng'
print(unihan_readings[char].get('kJapaneseOn')) # 'CHUU'
print(unihan_readings[char].get('kKorean')) # 'CWUNG'
Unihan and Han Unification
The Unihan database is also the technical record of the Han Unification process — the controversial but pragmatic decision to encode semantically equivalent characters from Chinese, Japanese, and Korean as single Unicode code points. The kIRGSources property traces each character back to its source standard (CNS 11643, GB 2312, JIS X 0208, KS X 1001, etc.), documenting the source glyph that justified each code point's inclusion.