SymbolFYI

Unicode

Unicode Standard

Définition

A universal character encoding standard that assigns a unique number (code point) to every character across all writing systems.

Unicode

Unicode is the universal character encoding standard that assigns a unique number — called a code point — to every character across all human writing systems, symbols, and emoji. Maintained by the Unicode Consortium, it is the foundation of modern text handling in software.

What Is a Code Point?

A code point is written as U+ followed by a hexadecimal number. For example:

U+0041 → A (Latin capital letter A)
U+4E2D → 中 (Chinese character for "middle")
U+1F600 → 😀 (grinning face emoji)

The Unicode standard currently defines code points in the range U+0000 to U+10FFFF, covering over 1.1 million possible characters, of which about 149,000 are currently assigned.

Unicode vs. Encodings

Unicode itself is an abstract standard — it defines what each code point represents, but not how those code points are stored as bytes. That is the job of encodings such as:

UTF-8: Variable-width (1–4 bytes), ASCII-compatible, the dominant encoding on the web
UTF-16: Variable-width (2 or 4 bytes), used internally by Windows and Java
UTF-32: Fixed-width (4 bytes per code point), simple but memory-inefficient

# Python: encode a string to bytes using UTF-8
text = 'Hello, 世界'
bytes_utf8 = text.encode('utf-8')
print(bytes_utf8)  # b'Hello, \xe4\xb8\x96\xe7\x95\x8c'

# Decode back to string
print(bytes_utf8.decode('utf-8'))  # Hello, 世界

Unicode Planes

The Unicode code space is divided into 17 planes, each containing 65,536 code points:

Plane 0 (Basic Multilingual Plane, BMP): U+0000–U+FFFF — covers most modern scripts
Plane 1 (Supplementary Multilingual Plane): emoji, historic scripts, musical notation
Planes 2–3: CJK unified ideograph extensions
Planes 4–13: Unassigned
Plane 14: Supplementary special-purpose characters
Planes 15–16: Private use areas

Unicode Categories

Every character belongs to a general category such as Lu (uppercase letter), Nd (decimal digit), Po (other punctuation), or So (other symbol). These categories are used extensively in regular expressions and text processing.

import unicodedata

print(unicodedata.category('A'))   # Lu (uppercase letter)
print(unicodedata.category('3'))   # Nd (decimal digit)
print(unicodedata.category('!'))   # Po (other punctuation)
print(unicodedata.name('😀'))      # GRINNING FACE

Normalization

The same visual character can sometimes be represented by multiple code point sequences. Unicode defines normalization forms (NFC, NFD, NFKC, NFKD) to ensure consistent comparison:

import unicodedata

# 'é' can be one precomposed code point or two (e + combining accent)
a = '\u00e9'           # precomposed
b = 'e\u0301'          # decomposed
print(a == b)          # False — different byte sequences
print(unicodedata.normalize('NFC', b) == a)  # True

Why Unicode Matters

Before Unicode, hundreds of incompatible encodings existed (Latin-1, Shift-JIS, Windows-1252, etc.), causing garbled text when data crossed system boundaries. Unicode eliminated this fragmentation and made it possible to represent any language in a single document — the prerequisite for the global, multilingual web.

Termes associés

Unicode

Unicode

What Is a Code Point?

Unicode vs. Encodings

Unicode Planes

Unicode Categories

Normalization

Why Unicode Matters

Symboles associés

Termes associés

Outils associés

Guides associés