Unicode Planes and Blocks: How 1.1 Million Code Points Are Organized

Unicode Deep Dive Unicode Deep Dive Feb 28, 2023

○ 1. What Is Unicode? The Universal Character Standard Explained
● 2. Unicode Planes and Blocks: How 1.1 Million Code Points Are Organized
○ 3. Unicode Encodings Explained: UTF-8, UTF-16, and UTF-32 Compared
○ 4. Unicode Normalization: NFC, NFD, NFKC, and NFKD Explained
○ 5. Unicode Properties and Categories: Classifying Every Character
○ 6. Bidirectional Text in Unicode: How RTL and LTR Scripts Coexist
○ 7. How Emoji Work in Unicode: From Code Points to Skin Tones
○ 8. CJK Unification: How Unicode Handles Chinese, Japanese, and Korean
○ 9. Unicode Version History: From 1.0 to 16.0 and Beyond
○ 10. Unicode CLDR: The Database Behind Every Localized App

Table of Contents

Unicode defines a space of 1,114,112 possible code points — far more than anyone will ever need, but there is a method to the arrangement. Rather than a flat list, code points are organized into a two-level hierarchy: planes and blocks. Understanding this structure helps you predict where to find specific characters, understand why some characters behave differently in code, and reason about Unicode's coverage of human writing systems.

The Overall Structure

Unicode's code point space spans from U+0000 to U+10FFFF. This range is divided into 17 planes, each containing exactly 65,536 (0x10000) code points. The planes are numbered 0 through 16.

Plane	Range	Name	Common Contents
0	U+0000–U+FFFF	Basic Multilingual Plane (BMP)	Most modern scripts, symbols
1	U+10000–U+1FFFF	Supplementary Multilingual Plane (SMP)	Historic scripts, emoji, music notation
2	U+20000–U+2FFFF	Supplementary Ideographic Plane (SIP)	Extension CJK ideographs
3	U+30000–U+3FFFF	Tertiary Ideographic Plane (TIP)	Extremely rare CJK extensions
4–13	U+40000–U+DFFFF	Unassigned	Reserved for future use
14	U+E0000–U+EFFFF	Supplementary Special-purpose Plane (SSP)	Tags, variation selectors
15	U+F0000–U+FFFFF	Supplementary Private Use Area-A	Private use
16	U+100000–U+10FFFF	Supplementary Private Use Area-B	Private use

Not all planes are populated. Planes 4 through 13 are entirely unassigned — reserved for hypothetical future characters. As of Unicode 16.0, only planes 0, 1, 2, 3, and 14 contain assigned characters.

Plane 0: The Basic Multilingual Plane

The BMP is by far the most important plane. It contains the characters needed for virtually all modern living languages, plus a vast array of symbols.

Why the BMP Is Special

The BMP's special status is partly historical. When Unicode was first designed, its creators hoped 65,536 characters would be sufficient for all human writing. That turned out to be optimistic — more characters were needed, particularly for CJK (Chinese, Japanese, Korean) ideographs — but the BMP retained its privileged status.

From a technical standpoint, the BMP is special in UTF-16: every BMP code point fits in a single 16-bit code unit. Characters outside the BMP require surrogate pairs — two 16-bit code units — which complicates string handling in languages like JavaScript and Java that use UTF-16 internally. We cover this in detail in Unicode Encodings Explained.

Major BMP Blocks

The BMP contains over 150 named blocks. Here are the most significant:

Basic Latin (U+0000–U+007F) The first 128 code points are identical to ASCII. This compatibility was deliberate — UTF-8 encodes these characters identically to ASCII, ensuring backward compatibility with ASCII-encoded files.

Latin-1 Supplement (U+0080–U+00FF) Western European characters: accented vowels (é, ü, ñ), ligatures (æ, œ), and symbols (©, ®, °). These were the characters added in Latin-1 (ISO 8859-1) and are by far the most frequently used non-ASCII characters on the Western web.

General Punctuation (U+2000–U+206F) Typographic punctuation marks including the em dash (—, U+2014), en dash (–, U+2013), curly quotes (" " ' '), ellipsis (…, U+2026), and zero-width characters critical for text processing.

Currency Symbols (U+20A0–U+20CF) Currency symbols including the Euro sign (€, U+20AC), Bitcoin sign (₿, U+20BF), Indian Rupee (₹, U+20B9), and others. Note that $ (U+0024), £ (U+00A3), and ¥ (U+00A5) appear in earlier blocks.

Mathematical Operators (U+2200–U+22FF) The full suite of standard mathematical operators: ∀ ∃ ∈ ∉ ∩ ∪ ∑ ∏ ∫ ∞ ≤ ≥ ≠ ≈ and more.

CJK Unified Ideographs (U+4E00–U+9FFF) The main CJK block containing 20,902 ideographs shared across Chinese, Japanese, and Korean. Understanding the Han Unification process — and the controversy it generated — is explored in CJK Unification: How Unicode Handles Chinese, Japanese, and Korean.

Hangul Syllables (U+AC00–U+D7AF) 11,172 precomposed Korean syllable blocks, algorithmically generated from the 19 consonants and 21 vowels of the Korean alphabet.

Surrogates (U+D800–U+DFFF) These 2,048 code points are permanently reserved for UTF-16 surrogate pairs and are not valid Unicode scalar values. No character will ever be assigned to this range.

Private Use Area — BMP (U+E000–U+F8FF) 6,400 code points reserved for private agreements between parties. Common uses include corporate logo characters, game-specific symbols, and the icons in icon fonts like Font Awesome.

Specials (U+FFF0–U+FFFF) Includes U+FEFF (the Byte Order Mark), U+FFFD (the Replacement Character — the □ you see when a character cannot be displayed), and U+FFFF/U+FFFE (non-characters used as sentinels).

Plane 1: Supplementary Multilingual Plane

The SMP is the second-largest populated plane, containing a diverse mix of historic scripts, specialized notation systems, and — most visibly to everyday users — emoji.

Linear B (U+10000–U+1007F) The writing system of Mycenaean Greek, deciphered in 1952. Used primarily by classicists and archaeologists.

Gothic (U+10330–U+1034F) The alphabet used for the Gothic Bible translation by Bishop Wulfila (4th century CE), now extinct as a spoken language.

Old Turkic (U+10C00–U+10C4F) The runic script used for the Old Turkic (Orkhon) inscriptions of Central Asia.

Musical Symbols (U+1D100–U+1D1FF) A comprehensive set of Western musical notation symbols: clefs, note durations, rests, dynamics, ornaments, and more. Applications like music engraving software use these for text-based representation of scores.

Mathematical Alphanumeric Symbols (U+1D400–U+1D7FF) Styled variants of mathematical letters: bold, italic, script, fraktur, double-struck, and monospace. These are used in mathematical typesetting when semantic distinction between letter styles is important. For example, ℕ (U+2115, double-struck N) represents the natural numbers.

Emoji (scattered through U+1F000–U+1FAFF) Multiple emoji blocks including Mahjong Tiles, Domino Tiles, Playing Cards, Enclosed Alphanumeric Supplement, Miscellaneous Symbols and Pictographs, Emoticons, and more. The distribution across non-contiguous blocks reflects the historical process of emoji being added to Unicode incrementally. See How Emoji Work in Unicode for the full story.

Plane 2: Supplementary Ideographic Plane

The SIP exists almost entirely to accommodate the enormous number of CJK ideographs that could not fit in the BMP.

CJK Unified Ideographs Extension B (U+20000–U+2A6DF) 42,720 additional CJK characters, primarily rare characters used in classical literature, historical documents, and personal name registers in China, Japan, and Korea. Most everyday users will never need them.

CJK Unified Ideographs Extensions C, D, E, F (U+2A700–U+2CEAF) Further CJK extensions added in successive Unicode versions, totaling tens of thousands of additional ideographs.

CJK Compatibility Ideographs Supplement (U+2F800–U+2FA1F) Compatibility forms of ideographs that had already been encoded elsewhere, maintained for round-trip compatibility with older East Asian standards.

Plane 3: Tertiary Ideographic Plane

Added in Unicode 13.0, TIP currently contains only one block:

CJK Unified Ideographs Extension G (U+30000–U+3134F) 4,939 extremely rare ideographs, primarily from oracle bone inscriptions and other ancient Chinese writing. These characters are so obscure that even specialists rarely encounter them in practice.

Plane 14: Supplementary Special-purpose Plane

Tags (U+E0000–U+E007F) Originally intended for language tags embedded in plain text, most of these were deprecated in Unicode 5.0. However, U+E0001 through U+E007F were repurposed in Unicode 8.0 as the basis for emoji flag sequences. The flag emoji (🇺🇸, 🇯🇵, etc.) are encoded as pairs of Regional Indicator letters from this plane.

Variation Selectors Supplement (U+E0100–U+E01EF) 240 additional variation selectors, supplementing the 16 variation selectors in the BMP (U+FE00–U+FE0F). Variation selectors modify the appearance of the preceding character — for example, selecting between text and emoji presentation, or between different glyphic variants of a CJK character.

Planes 15–16: Private Use Areas

Both planes are entirely reserved as Private Use Areas (PUAs). Anyone may use these code points for any purpose, with the understanding that their meaning is application-specific and not interoperable without a shared agreement.

Common uses include: - Corporate fonts: Company logos embedded as characters - Proprietary symbol sets: Specialized technical or scientific notation - Game development: Game-specific iconography - Academic research: Specialized notation for linguistics, musicology, etc.

The BMP also contains a PUA (U+E000–U+F8FF), but it is much smaller (6,400 points vs. 65,534 points per supplementary PUA plane).

Understanding Blocks

Within each plane, code points are further organized into blocks — named, contiguous ranges that group related characters. Block boundaries are always multiples of 16, and blocks never overlap.

As of Unicode 16.0, there are 333 defined blocks. Some are densely populated (the CJK Unified Ideographs block is completely full with 20,902 characters), while others are sparse (the Gothic block has 27 assigned characters out of 32 possible).

How Blocks Are Named

Block names follow patterns that reflect: - Script name: "Arabic," "Thai," "Hangul" - Character type: "Mathematical Operators," "Currency Symbols," "Arrows" - Historical origin: "Latin Extended-A," "CJK Compatibility Forms" - Purpose: "Private Use Area," "Surrogates"

Browsing Blocks

The SymbolFYI Symbol Table lets you browse characters by Unicode block. The Unicode Lookup tool identifies which block any character belongs to.

In Python, you can determine a character's block programmatically:

import unicodedata

def get_block(char):
    cp = ord(char)
    # Unicode block data from unicodedata module (Python 3.13+)
    # For older Python, use the 'unicodeblock' package
    name = unicodedata.name(char, 'UNKNOWN')
    return name

# Check which block a character is in
print(unicodedata.name('A'))    # LATIN CAPITAL LETTER A
print(unicodedata.name('α'))    # GREEK SMALL LETTER ALPHA
print(unicodedata.name('中'))   # CJK UNIFIED IDEOGRAPH-4E2D
print(unicodedata.name('😀'))  # GRINNING FACE

In JavaScript, there is no built-in block lookup, but you can use the unicode-properties npm package or determine blocks by code point range:

function getUnicodePlane(codePoint) {
  return Math.floor(codePoint / 0x10000);
}

const cp = '😀'.codePointAt(0);  // 128512
console.log(getUnicodePlane(cp)); // 1 (Supplementary Multilingual Plane)

// Check if code point is in BMP
console.log(cp <= 0xFFFF);  // false — 😀 is in Plane 1

The Private Use Area in Practice

The PUA deserves special attention because it is widely used in production systems but frequently misunderstood.

Icon fonts like Font Awesome and Material Icons map their icons to PUA code points in the BMP. When you use a class like fa-heart, the CSS inserts a character like U+F004 — a PUA code point that the font maps to a heart icon image. In a different font, U+F004 might display as nothing, a box, or something completely different.

This approach is technically valid — the PUA is designed for exactly this kind of private agreement. The implication for developers is that PUA characters are never safe to store without context, and they must never be processed as if they have semantic Unicode meaning.

Emoji recommendation: If you need to display icons in text, use actual Unicode emoji or SVG icons rather than PUA-based icon fonts. This improves accessibility (screen readers can handle emoji with proper alt text; PUA characters are meaningless without the font), and it works without font loading.

Code Point Density and Sparseness

Not every possible code point within a plane has an assigned character. Out of 1,114,112 total possible code points:

154,998 are assigned to characters (Unicode 16.0)
137,928 are reserved for future use
66,818 are designated for Private Use
2,048 are permanently reserved for surrogates
66 are designated as non-characters (U+FDD0–U+FDEF, and U+xxFFFE/U+xxFFFF for each plane)
~750,000 are simply unassigned

This sparseness is intentional. Unicode's designers reserved huge blocks of space to accommodate future discoveries — lost scripts, newly documented languages, specialized notation systems we cannot yet anticipate.

Finding Characters by Block

When hunting for a specific symbol, knowing the block structure helps:

Mathematical symbols: Check Mathematical Operators (U+2200), Supplemental Mathematical Operators (U+2A00), Mathematical Alphanumeric Symbols (U+1D400), or Letterlike Symbols (U+2100)
Arrows: Arrows (U+2190), Supplemental Arrows-A/B/C (U+27F0, U+2900, U+1F800), Miscellaneous Technical (U+2300)
Box drawing: Box Drawing (U+2500), Block Elements (U+2580)
Dingbats and decorative: Dingbats (U+2700), Ornamental Dingbats (U+1F650), Miscellaneous Symbols (U+2600)
Historic Latin variants: Latin Extended-A through G, scattered through U+0100–U+AB6F

Our Symbol Table organizes all of this by block so you can browse visually.

Summary

Unicode organizes its 1,114,112 code points into 17 planes of 65,536 each, further subdivided into named blocks. The key planes to know:

Plane 0 (BMP): All modern scripts, most symbols — the everyday Unicode
Plane 1 (SMP): Historic scripts, emoji, musical notation
Plane 2 (SIP): Rare CJK extensions
Planes 15–16: Private Use Areas

The BMP's special status — all code points fit in 16 bits — has lasting technical implications for UTF-16-based languages. Understanding which plane a character lives in helps explain surrogate pair behavior and string length anomalies you may encounter in JavaScript and Java.

Next in Series: Unicode Encodings Explained: UTF-8, UTF-16, and UTF-32 Compared — Understand how code points are actually stored as bytes, and why UTF-8 won the web.