Unicode in Python
Python 3 made Unicode strings the default string type, eliminating the Python 2 byte/unicode confusion. The str type represents a sequence of Unicode code points, and the language provides comprehensive support for encoding, decoding, and introspecting Unicode text through the standard library.
str Is Unicode
In Python 3, string literals are Unicode by default. The source file is interpreted as UTF-8 (PEP 3120):
# All valid Python 3 string literals
text = 'hello'
text = 'café'
text = '日本語'
text = '\u2603' # SNOWMAN by escape
text = '\U0001F600' # 😀 by full code point escape
text = '\N{SNOWMAN}' # By Unicode character name
Encoding and Decoding
Conversion between str (Unicode) and bytes is explicit:
# str → bytes: encode
'café'.encode('utf-8') # b'caf\xc3\xa9'
'café'.encode('latin-1') # b'caf\xe9'
'日本語'.encode('utf-8') # b'\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e'
# bytes → str: decode
b'caf\xc3\xa9'.decode('utf-8') # 'café'
b'caf\xe9'.decode('latin-1') # 'café'
# Handling errors
'café'.encode('ascii') # UnicodeEncodeError
'café'.encode('ascii', errors='ignore') # b'caf'
'café'.encode('ascii', errors='replace') # b'caf?'
'café'.encode('ascii', errors='xmlcharrefreplace') # b'café'
The unicodedata Module
import unicodedata
# Character name
unicodedata.name('☃') # 'SNOWMAN'
unicodedata.name('é') # 'LATIN SMALL LETTER E WITH ACUTE'
# Category (two-letter code)
unicodedata.category('A') # 'Lu' (Uppercase letter)
unicodedata.category('a') # 'Ll' (Lowercase letter)
unicodedata.category('1') # 'Nd' (Decimal number)
unicodedata.category(' ') # 'Zs' (Space separator)
unicodedata.category('\u200B') # 'Cf' (Format character)
# Code point
ord('☃') # 9731 (decimal)
hex(ord('☃')) # '0x2603'
# Character from code point
chr(9731) # '☃'
chr(0x2603) # '☃'
# Numeric value
unicodedata.numeric('½') # 0.5
unicodedata.digit('7') # 7
# Normalization
unicodedata.normalize('NFC', 'e\u0301') # 'é' (precomposed)
unicodedata.normalize('NFD', 'é') # 'e\u0301' (decomposed)
File I/O with Encoding
# Always specify encoding for text files
with open('data.txt', 'r', encoding='utf-8') as f:
text = f.read()
# Write with explicit encoding
with open('output.txt', 'w', encoding='utf-8') as f:
f.write('café ☃ 日本語')
# Handle BOMs in Windows files
with open('windows_file.txt', 'r', encoding='utf-8-sig') as f:
text = f.read() # BOM stripped automatically
Working with Emoji and Supplementary Characters
emoji = '😀' # U+1F600
len(emoji) # 1 (one code point)
ord(emoji) # 128512 (0x1F600)
len(emoji.encode('utf-8')) # 4 bytes
# Complex emoji with ZWJ sequences
family = '👨👩👧👦' # Family emoji
len(family) # 7 (4 emoji + 3 ZWJ chars)
# For grapheme cluster counting:
import grapheme
grapheme.length(family) # 1
Sorting and Comparison
# Default sort uses code point order — correct for ASCII
sorted(['b', 'a', 'c']) # ['a', 'b', 'c']
# Code point order may be wrong for accented characters
sorted(['cote', 'côte', 'coté']) # May not match French dictionary order
# Locale-aware sorting
import locale
locale.setlocale(locale.LC_ALL, 'fr_FR.UTF-8')
sorted(words, key=locale.strxfrm)
# For production: use PyICU
import icu
collator = icu.Collator.createInstance(icu.Locale('fr'))
sorted(words, key=collator.getSortKey)