SymbolFYI

Code Point vs Character vs Glyph

Programming & Dev

คำจำกัดความ

Understanding the three abstraction levels: a code point (number), a character (abstract), and a glyph (visual rendering).

Code Points vs. Characters in Programming

The distinction between a Unicode code point and the "char" type in programming languages is one of the most common sources of Unicode-related bugs. A code point is an abstract integer assigned to each Unicode character; how that integer is stored in memory depends on the encoding used, and most languages' char types do not correspond to a single code point for all Unicode characters.

Unicode Code Points

A Unicode code point is an integer in the range U+0000 to U+10FFFF. The Unicode standard assigns a code point to every character in its repertoire: U+0041 is LATIN CAPITAL LETTER A, U+2603 is SNOWMAN, U+1F600 is GRINNING FACE. Code points are abstract—they have no inherent byte representation.

U+0041  →  A   (1 byte in UTF-8: 0x41)
U+00E9  →  é   (2 bytes in UTF-8: 0xC3 0xA9)
U+2603  →  ☃   (3 bytes in UTF-8: 0xE2 0x98 0x83)
U+1F600 →  😀  (4 bytes in UTF-8: 0xF0 0x9F 0x98 0x80)

Language-Specific char Types

C and C++

char is 1 byte. It cannot represent most Unicode characters directly. wchar_t is platform-dependent (2 bytes on Windows, 4 bytes on Linux). Use char32_t for guaranteed 1:1 code point storage, or char8_t (C++20) for UTF-8:

char a = 'A';          // OK: ASCII
char e = 'é';          // NOT OK: multi-byte UTF-8 character
char32_t s = U'☃';    // OK: U+2603 stored as 32-bit integer
char32_t emoji = U'😀'; // OK: U+1F600

Java

Java's char is a 16-bit UTF-16 code unit. Characters outside the BMP (above U+FFFF) require two char values (a surrogate pair). int is used for full code point values:

char a = 'A';           // OK: U+0041
char snowman = '\u2603'; // OK: U+2603
// char emoji = '\uD83D'; // Only the high surrogate - WRONG

// Use int for code points
int emoji = "😀".codePointAt(0);  // 128512 (0x1F600)
String.valueOf(Character.toChars(emoji))  // "😀"

// Iterate code points, not chars
"😀A".codePoints().forEach(cp -> {
    System.out.println(Integer.toHexString(cp)); // 1f600, then 41
});

Python 3

Python 3's str type stores text as code points (conceptually; CPython uses flexible internal encoding). len() counts code points, indexing returns one-code-point strings:

text = '😀'
len(text)      # 1 (one code point)
text[0]        # '😀' (one-code-point string)
ord(text[0])   # 128512 (0x1F600)
chr(128512)    # '😀'

Rust

Rust's char type represents a single Unicode scalar value (code point, excluding surrogates). String is UTF-8 encoded bytes. Indexing a String by character position is not O(1):

let c: char = '\u{1F600}';  // 😀 as a single char
let code: u32 = c as u32;   // 128512
let back: char = char::from_u32(128512).unwrap();

let s = String::from("😀");
s.chars().count()  // 1 code point
s.len()            // 4 bytes

Grapheme Clusters: Above Code Points

Even code points are not the final level of abstraction. User-perceived characters (grapheme clusters) can consist of multiple code points:

é = U+00E9 (one code point, NFC form)
é = U+0065 + U+0301 (two code points: e + combining acute, NFD form)
👨‍👩‍👧‍👦 = U+1F468 + U+200D + U+1F469 + U+200D + U+1F467 + U+200D + U+1F466
         (7 code points: 4 emoji + 3 ZWJ characters)

For user-facing string operations (length display, truncation, cursor movement), grapheme cluster segmentation is the correct abstraction.