SymbolFYI

Unicode

Unicode Standard
定義

A universal character encoding standard that assigns a unique number (code point) to every character across all writing systems.

Unicode

Unicode is the universal character encoding standard that assigns a unique number — called a code point — to every character across all human writing systems, symbols, and emoji. Maintained by the Unicode Consortium, it is the foundation of modern text handling in software.

What Is a Code Point?

A code point is written as U+ followed by a hexadecimal number. For example:

  • U+0041A (Latin capital letter A)
  • U+4E2D (Chinese character for "middle")
  • U+1F600😀 (grinning face emoji)

The Unicode standard currently defines code points in the range U+0000 to U+10FFFF, covering over 1.1 million possible characters, of which about 149,000 are currently assigned.

Unicode vs. Encodings

Unicode itself is an abstract standard — it defines what each code point represents, but not how those code points are stored as bytes. That is the job of encodings such as:

  • UTF-8: Variable-width (1–4 bytes), ASCII-compatible, the dominant encoding on the web
  • UTF-16: Variable-width (2 or 4 bytes), used internally by Windows and Java
  • UTF-32: Fixed-width (4 bytes per code point), simple but memory-inefficient
# Python: encode a string to bytes using UTF-8
text = 'Hello, 世界'
bytes_utf8 = text.encode('utf-8')
print(bytes_utf8)  # b'Hello, \xe4\xb8\x96\xe7\x95\x8c'

# Decode back to string
print(bytes_utf8.decode('utf-8'))  # Hello, 世界

Unicode Planes

The Unicode code space is divided into 17 planes, each containing 65,536 code points:

  • Plane 0 (Basic Multilingual Plane, BMP): U+0000U+FFFF — covers most modern scripts
  • Plane 1 (Supplementary Multilingual Plane): emoji, historic scripts, musical notation
  • Planes 2–3: CJK unified ideograph extensions
  • Planes 4–13: Unassigned
  • Plane 14: Supplementary special-purpose characters
  • Planes 15–16: Private use areas

Unicode Categories

Every character belongs to a general category such as Lu (uppercase letter), Nd (decimal digit), Po (other punctuation), or So (other symbol). These categories are used extensively in regular expressions and text processing.

import unicodedata

print(unicodedata.category('A'))   # Lu (uppercase letter)
print(unicodedata.category('3'))   # Nd (decimal digit)
print(unicodedata.category('!'))   # Po (other punctuation)
print(unicodedata.name('😀'))      # GRINNING FACE

Normalization

The same visual character can sometimes be represented by multiple code point sequences. Unicode defines normalization forms (NFC, NFD, NFKC, NFKD) to ensure consistent comparison:

import unicodedata

# 'é' can be one precomposed code point or two (e + combining accent)
a = '\u00e9'           # precomposed
b = 'e\u0301'          # decomposed
print(a == b)          # False — different byte sequences
print(unicodedata.normalize('NFC', b) == a)  # True

Why Unicode Matters

Before Unicode, hundreds of incompatible encodings existed (Latin-1, Shift-JIS, Windows-1252, etc.), causing garbled text when data crossed system boundaries. Unicode eliminated this fragmentation and made it possible to represent any language in a single document — the prerequisite for the global, multilingual web.

関連記号

関連用語

関連ツール

関連ガイド

How to Use the SymbolFYI Fancy Text Generator
A guide to SymbolFYI's Fancy Text Generator — convert text to Unicode bold, italic, script, fraktur, and monospace styles for social media.
How to Use the SymbolFYI Unicode Lookup Tool
A guide to SymbolFYI's Unicode Lookup — enter a U+ codepoint to see the character's name, block, script, and full encoding details.
How to Use the SymbolFYI Symbol Search Tool
A complete guide to SymbolFYI's Symbol Search — find Unicode characters by name, keyword, HTML entity, or pasted character, with one-click copy in any format.
The History of Unicode: From Babel to a Universal Character Set
How the Unicode Consortium unified the world's writing systems — from the 1987 founding to Unicode 16.0 with over 154,000 characters.
Multiplication Sign (×) vs Letter X: Spot the Difference
Distinguish the multiplication sign (×, U+00D7) from lowercase x and uppercase X — visual comparison, Unicode properties, and proper usage in math.
Code Point vs Character vs Glyph: The Three Levels of Text
Understand the three levels of text representation — code points (numbers), characters (abstract identities), and glyphs (visual shapes in fonts).
What Is a Code Point? Understanding Unicode's U+ Notation
Learn what Unicode code points are — the U+ notation system, how code points differ from characters and glyphs, and how to find any character's code point.
What Is Unicode? The Universal Character Standard Explained
Learn what Unicode is, why it was created, and how it assigns a unique code point to every character in every writing system.