Unicode
Unicode is the universal character encoding standard that assigns a unique number — called a code point — to every character across all human writing systems, symbols, and emoji. Maintained by the Unicode Consortium, it is the foundation of modern text handling in software.
What Is a Code Point?
A code point is written as U+ followed by a hexadecimal number. For example:
U+0041→A(Latin capital letter A)U+4E2D→中(Chinese character for "middle")U+1F600→😀(grinning face emoji)
The Unicode standard currently defines code points in the range U+0000 to U+10FFFF, covering over 1.1 million possible characters, of which about 149,000 are currently assigned.
Unicode vs. Encodings
Unicode itself is an abstract standard — it defines what each code point represents, but not how those code points are stored as bytes. That is the job of encodings such as:
- UTF-8: Variable-width (1–4 bytes), ASCII-compatible, the dominant encoding on the web
- UTF-16: Variable-width (2 or 4 bytes), used internally by Windows and Java
- UTF-32: Fixed-width (4 bytes per code point), simple but memory-inefficient
# Python: encode a string to bytes using UTF-8
text = 'Hello, 世界'
bytes_utf8 = text.encode('utf-8')
print(bytes_utf8) # b'Hello, \xe4\xb8\x96\xe7\x95\x8c'
# Decode back to string
print(bytes_utf8.decode('utf-8')) # Hello, 世界
Unicode Planes
The Unicode code space is divided into 17 planes, each containing 65,536 code points:
- Plane 0 (Basic Multilingual Plane, BMP):
U+0000–U+FFFF— covers most modern scripts - Plane 1 (Supplementary Multilingual Plane): emoji, historic scripts, musical notation
- Planes 2–3: CJK unified ideograph extensions
- Planes 4–13: Unassigned
- Plane 14: Supplementary special-purpose characters
- Planes 15–16: Private use areas
Unicode Categories
Every character belongs to a general category such as Lu (uppercase letter), Nd (decimal digit), Po (other punctuation), or So (other symbol). These categories are used extensively in regular expressions and text processing.
import unicodedata
print(unicodedata.category('A')) # Lu (uppercase letter)
print(unicodedata.category('3')) # Nd (decimal digit)
print(unicodedata.category('!')) # Po (other punctuation)
print(unicodedata.name('😀')) # GRINNING FACE
Normalization
The same visual character can sometimes be represented by multiple code point sequences. Unicode defines normalization forms (NFC, NFD, NFKC, NFKD) to ensure consistent comparison:
import unicodedata
# 'é' can be one precomposed code point or two (e + combining accent)
a = '\u00e9' # precomposed
b = 'e\u0301' # decomposed
print(a == b) # False — different byte sequences
print(unicodedata.normalize('NFC', b) == a) # True
Why Unicode Matters
Before Unicode, hundreds of incompatible encodings existed (Latin-1, Shift-JIS, Windows-1252, etc.), causing garbled text when data crossed system boundaries. Unicode eliminated this fragmentation and made it possible to represent any language in a single document — the prerequisite for the global, multilingual web.