SymbolFYI

Python unicodedata Module

Programming & Dev
परिभाषा

Python standard library module for looking up Unicode character names, categories, and properties.

Unicode in Python

Python 3 made Unicode strings the default string type, eliminating the Python 2 byte/unicode confusion. The str type represents a sequence of Unicode code points, and the language provides comprehensive support for encoding, decoding, and introspecting Unicode text through the standard library.

str Is Unicode

In Python 3, string literals are Unicode by default. The source file is interpreted as UTF-8 (PEP 3120):

# All valid Python 3 string literals
text = 'hello'
text = 'café'
text = '日本語'
text = '\u2603'       # SNOWMAN by escape
text = '\U0001F600'   # 😀 by full code point escape
text = '\N{SNOWMAN}'  # By Unicode character name

Encoding and Decoding

Conversion between str (Unicode) and bytes is explicit:

# str → bytes: encode
'café'.encode('utf-8')    # b'caf\xc3\xa9'
'café'.encode('latin-1')  # b'caf\xe9'
'日本語'.encode('utf-8')   # b'\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e'

# bytes → str: decode
b'caf\xc3\xa9'.decode('utf-8')    # 'café'
b'caf\xe9'.decode('latin-1')      # 'café'

# Handling errors
'café'.encode('ascii')                         # UnicodeEncodeError
'café'.encode('ascii', errors='ignore')        # b'caf'
'café'.encode('ascii', errors='replace')       # b'caf?'
'café'.encode('ascii', errors='xmlcharrefreplace')  # b'café'

The unicodedata Module

import unicodedata

# Character name
unicodedata.name('☃')          # 'SNOWMAN'
unicodedata.name('é')          # 'LATIN SMALL LETTER E WITH ACUTE'

# Category (two-letter code)
unicodedata.category('A')      # 'Lu' (Uppercase letter)
unicodedata.category('a')      # 'Ll' (Lowercase letter)
unicodedata.category('1')      # 'Nd' (Decimal number)
unicodedata.category(' ')      # 'Zs' (Space separator)
unicodedata.category('\u200B') # 'Cf' (Format character)

# Code point
ord('☃')                       # 9731 (decimal)
hex(ord('☃'))                  # '0x2603'

# Character from code point
chr(9731)                      # '☃'
chr(0x2603)                    # '☃'

# Numeric value
unicodedata.numeric('½')       # 0.5
unicodedata.digit('7')         # 7

# Normalization
unicodedata.normalize('NFC', 'e\u0301')   # 'é' (precomposed)
unicodedata.normalize('NFD', 'é')          # 'e\u0301' (decomposed)

File I/O with Encoding

# Always specify encoding for text files
with open('data.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# Write with explicit encoding
with open('output.txt', 'w', encoding='utf-8') as f:
    f.write('café ☃ 日本語')

# Handle BOMs in Windows files
with open('windows_file.txt', 'r', encoding='utf-8-sig') as f:
    text = f.read()  # BOM stripped automatically

Working with Emoji and Supplementary Characters

emoji = '😀'  # U+1F600
len(emoji)              # 1 (one code point)
ord(emoji)              # 128512 (0x1F600)
len(emoji.encode('utf-8'))  # 4 bytes

# Complex emoji with ZWJ sequences
family = '👨‍👩‍👧‍👦'  # Family emoji
len(family)             # 7 (4 emoji + 3 ZWJ chars)

# For grapheme cluster counting:
import grapheme
grapheme.length(family)  # 1

Sorting and Comparison

# Default sort uses code point order — correct for ASCII
sorted(['b', 'a', 'c'])  # ['a', 'b', 'c']

# Code point order may be wrong for accented characters
sorted(['cote', 'côte', 'coté'])  # May not match French dictionary order

# Locale-aware sorting
import locale
locale.setlocale(locale.LC_ALL, 'fr_FR.UTF-8')
sorted(words, key=locale.strxfrm)

# For production: use PyICU
import icu
collator = icu.Collator.createInstance(icu.Locale('fr'))
sorted(words, key=collator.getSortKey)

संबंधित प्रतीक

संबंधित शब्द

संबंधित टूल

संबंधित गाइड