What Is a Code Point?
A code point is the fundamental unit of the Unicode standard — a unique numerical value assigned to every character, symbol, control code, and abstract entity in the Unicode repertoire. Every code point is written in the format U+ followed by four to six hexadecimal digits, such as U+0041 for the Latin capital letter A, U+1F600 for the grinning face emoji, or U+200D for the Zero Width Joiner.
The Unicode standard defines a total space of 1,114,112 code points, ranging from U+0000 to U+10FFFF. Not all of these are assigned to characters — some are reserved for future use, some are designated as private-use, and some are surrogates used by the UTF-16 encoding.
Code Point Notation
By convention, code points with values below U+FFFF are written with exactly four hex digits: U+0041. Code points in supplementary planes use five or six digits: U+1F600, U+10FFFF. The U+ prefix is always uppercase and there are no spaces within the notation.
In programming, you will often encounter code points expressed in different bases depending on the encoding or language:
# Python: get the code point of a character
code_point = ord('A') # 65 (decimal)
print(hex(code_point)) # '0x41'
print(f'U+{code_point:04X}') # 'U+0041'
# Convert a code point back to a character
char = chr(0x1F600) # returns the grinning face emoji
print(char) # prints the emoji
// JavaScript: get the code point of a character
const cp = ''.codePointAt(0); // 128512
console.log(cp.toString(16)); // '1f600'
console.log(`U+${cp.toString(16).toUpperCase().padStart(4, '0')}`); // 'U+1F600'
// Convert a code point back to a string
const char = String.fromCodePoint(0x1F600); // grinning face emoji
Code Points vs. Characters vs. Glyphs
These three terms are often confused but represent distinct concepts:
- A code point is the abstract number assigned by Unicode.
- A character is the semantic entity the code point represents (a letter, digit, symbol).
- A glyph is the visual representation rendered on screen by a specific font.
One code point may map to multiple glyphs depending on context (e.g., Arabic letters change shape based on their position in a word), and one visible character as perceived by a user (a grapheme cluster) may require multiple code points — for instance, a base letter combined with a diacritical mark.
Assigned vs. Unassigned Code Points
As of Unicode 16.0, roughly 154,998 code points are assigned to characters. The rest fall into categories such as unassigned, reserved, noncharacters, and surrogates. Noncharacters (like U+FFFE and U+FFFF) are permanently reserved and will never be assigned to a character; they are intended for internal use within applications.
Practical Importance for Developers
Understanding code points is essential when working with string length calculations, text processing, and encoding. For example, in JavaScript, ''.length returns 2 because the emoji sits outside the Basic Multilingual Plane and is represented as a surrogate pair in UTF-16. Using [...''].length (spread operator, which is code-point-aware) correctly returns 1. Similarly in Python 3, strings are sequences of code points, so len('') returns 1 as expected.