SymbolFYI

What Is a Code Point? Understanding Unicode's U+ Notation

Reference Avr 4, 2023

Every character you type, paste, or render on screen has a number behind it. In Unicode, that number is called a code point — the fundamental unit of the Unicode standard. Understanding code points unlocks how text truly works at the lowest level, and explains why character encoding, emoji lengths, and font behavior work the way they do.

What Is a Code Point?

A Unicode code point is an integer that serves as a unique identifier for an abstract character. The Unicode standard assigns a code point to every character it defines — letters, digits, punctuation, symbols, emoji, control characters, and much more.

Code points are written using U+ notation: the prefix U+ followed by a hexadecimal number with at least four digits. For example:

Character Code Point Name
A U+0041 LATIN CAPITAL LETTER A
© U+00A9 COPYRIGHT SIGN
U+20AC EURO SIGN
U+2603 SNOWMAN
😀 U+1F600 GRINNING FACE
𝄞 U+1D11E MUSICAL SYMBOL G CLEF

The U+ prefix is not a programming construct — it is a notation convention defined by the Unicode Consortium to unambiguously identify code points in documentation, specifications, and conversation.

The Range: U+0000 to U+10FFFF

Unicode's code space spans from U+0000 to U+10FFFF, giving a total of 1,114,112 possible code points. This range is divided into 17 planes, each containing 65,536 code points:

Plane Range Name Content
0 U+0000–U+FFFF Basic Multilingual Plane (BMP) Most living scripts, common symbols
1 U+10000–U+1FFFF Supplementary Multilingual Plane Historic scripts, emoji, musical notation
2 U+20000–U+2FFFF Supplementary Ideographic Plane CJK unified ideographs extension
3–13 U+30000–U+DFFFF Mostly unassigned
14 U+E0000–U+EFFFF Supplementary Special-purpose Plane Tags, variation selectors
15–16 U+F0000–U+10FFFF Private Use Planes User-defined characters

As of Unicode 15.1, approximately 149,813 code points are assigned. The rest are either unassigned (reserved for future characters) or designated for special purposes.

How to Read Hex Values

Code points are expressed in hexadecimal (base 16). If you are not accustomed to hex, here is what you need to know:

  • Hex digits run 0–9 then A–F, where A=10, B=11, C=12, D=13, E=14, F=15
  • Each hex digit represents 4 bits
  • U+ values are zero-padded to at least 4 digits for BMP characters, and 5–6 digits for supplementary planes

Converting hex to decimal: - U+0041: 4×16 + 1 = 65 (decimal) - U+20AC: 2×4096 + 0×256 + A(10)×16 + C(12) = 8192 + 0 + 160 + 12 = 8,364 (decimal) - U+1F600: 1×65536 + F(15)×4096 + 6×256 + 0×16 + 0 = 65536 + 61440 + 1536 = 128,512 (decimal)

In practice, you rarely need to do this math manually. The U+ notation is the standard way to refer to code points.

Assigned vs Unassigned Code Points

Not every integer in the 0–10FFFF range corresponds to a character. Code points fall into several categories:

Assigned characters — have a defined meaning, name, and category. The vast majority of code points you will encounter.

Unassigned — reserved for future use. Using an unassigned code point produces the replacement character (U+FFFD) or nothing, depending on the renderer.

Surrogates (U+D800–U+DFFF) — 2,048 code points that are permanently reserved for use in UTF-16 encoding mechanics. They are not valid Unicode characters and must never appear in UTF-8 text.

Private Use Area (PUA) — three ranges (U+E000–U+F8FF, U+F0000–U+FFFFF, U+100000–U+10FFFF) where applications can assign their own meanings. Icon fonts commonly use the BMP PUA range.

Noncharacters — 66 code points (including U+FFFE and U+FFFF) permanently reserved for internal use. They are valid in Unicode strings but should never be exchanged between systems.

glyph">Code Point vs Character vs Glyph

These three terms are often used interchangeably, but they describe different things:

  • A code point is a number in the Unicode code space. It is just an integer.
  • A character is the abstract identity that a code point represents — the concept of "the letter A" or "the euro sign."
  • A glyph is the actual visual shape that a font draws for a character. The same character can look different across fonts.

One code point usually corresponds to one character, but not always. Some characters require multiple code points in sequence (called combining character sequences or emoji sequences). A single glyph can also be produced from multiple code points — for example, an emoji with a skin tone modifier.

Finding Code Points in Code

Python

Python's ord() function returns the code point integer for a single character, and chr() does the reverse:

>>> ord('A')
65
>>> hex(ord('A'))
'0x41'
>>> ord('€')
8364
>>> hex(ord('€'))
'0x20ac'
>>> chr(0x1F600)
'😀'

# Unicode name
import unicodedata
unicodedata.name('€')
# 'EURO SIGN'

# Category
unicodedata.category('A')
# 'Lu'  (Uppercase Letter)

To get the U+ notation string:

def to_u_plus(char: str) -> str:
    return f"U+{ord(char):04X}"

to_u_plus('A')     # 'U+0041'
to_u_plus('😀')   # 'U+1F600'

JavaScript

JavaScript uses UTF-16 internally, which means supplementary plane characters (above U+FFFF) are represented as surrogate pairs. Use codePointAt() rather than charCodeAt() to get the correct code point:

// BMP character — both methods agree
'A'.charCodeAt(0)       // 65
'A'.codePointAt(0)      // 65

// Supplementary plane emoji
'😀'.charCodeAt(0)      // 55357  (high surrogate, wrong)
'😀'.codePointAt(0)     // 128512 (correct)

// U+ notation
function toUPlus(char) {
  return 'U+' + char.codePointAt(0).toString(16).toUpperCase().padStart(4, '0');
}
toUPlus('A')    // 'U+0041'
toUPlus('😀')  // 'U+1F600'

// Iterate by code point (not code unit)
for (const char of '😀 café') {
  console.log(toUPlus(char), char);
}

Browser DevTools

In the browser console, '😀'.codePointAt(0).toString(16) gives you the hex value. For quick lookups without code, the SymbolFYI Unicode Lookup tool accepts any character or U+ value and returns the full Unicode property record.

What the Unicode Name Tells You

Every assigned code point has an official Unicode name, written in all caps. The name is designed to be a stable, unambiguous identifier — it never changes once assigned (a rule known as the name stability guarantee):

U+0041  LATIN CAPITAL LETTER A
U+00E9  LATIN SMALL LETTER E WITH ACUTE
U+20AC  EURO SIGN
U+1F600 GRINNING FACE
U+FFFD  REPLACEMENT CHARACTER

Names follow conventions: - Letters are described by script, case, and any modifiers - Symbols describe their visual appearance or semantic use - Emoji names describe what they depict (as of the time of encoding) - Control characters have functional names

When a character's name was assigned incorrectly (the Unicode Consortium has made mistakes), it cannot be corrected in the Name field. Instead, a Name Alias is added — for example, U+0022 is officially named "QUOTATION MARK" but its alias is "DOUBLE QUOTE."

Code Point Notation in Different Contexts

The U+ notation is used in Unicode documentation and general discussion. In code and markup, you reference code points differently:

Context Syntax Example for U+20AC
Unicode notation U+XXXX U+20AC
HTML decimal &#N; €
HTML hex &#xXXXX; €
HTML named &name; €
CSS \XXXX \20AC
JavaScript string \uXXXX \u20AC
JavaScript (supplementary) \u{XXXXX} \u{1F600}
Python string \uXXXX or \UXXXXXXXX \u20AC
XML/SVG &#xXXXX; €

Note that CSS escapes require a trailing space or delimiter when the next character is a valid hex digit, to avoid ambiguity.

Practical Implications

Understanding code points matters in these common scenarios:

String operations: When you slice a string by index, you may cut in the middle of a multi-code-point sequence (like an emoji with a skin tone modifier). Knowing about code points helps you understand why this produces garbled output.

Regular expressions: Regex character classes like \w or . may not match supplementary plane characters correctly in all languages without special flags.

Sorting and comparison: Unicode provides the Unicode Collation Algorithm (UCA) for language-aware sorting. Naively comparing code point integers gives ASCII-like ordering that does not match natural language alphabetization.

Font rendering: A font may only cover a subset of Unicode code points. When a code point has no glyph in the active font, the system falls back to another font or renders a replacement box (□) or question mark.

Database storage: MySQL's utf8 charset only supports BMP characters (up to U+FFFF). To store emoji and other supplementary plane characters, you must use utf8mb4. PostgreSQL's text type handles all Unicode code points correctly.

Use the SymbolFYI Unicode Lookup tool to explore any character's code point, Unicode name, block, category, and encoding representations in one place.

Symboles associés

Glossaire associé

Outils associés

Plus de guides