Code Points vs. Characters in Programming
The distinction between a Unicode code point and the "char" type in programming languages is one of the most common sources of Unicode-related bugs. A code point is an abstract integer assigned to each Unicode character; how that integer is stored in memory depends on the encoding used, and most languages' char types do not correspond to a single code point for all Unicode characters.
Unicode Code Points
A Unicode code point is an integer in the range U+0000 to U+10FFFF. The Unicode standard assigns a code point to every character in its repertoire: U+0041 is LATIN CAPITAL LETTER A, U+2603 is SNOWMAN, U+1F600 is GRINNING FACE. Code points are abstract—they have no inherent byte representation.
U+0041 → A (1 byte in UTF-8: 0x41)
U+00E9 → é (2 bytes in UTF-8: 0xC3 0xA9)
U+2603 → ☃ (3 bytes in UTF-8: 0xE2 0x98 0x83)
U+1F600 → 😀 (4 bytes in UTF-8: 0xF0 0x9F 0x98 0x80)
Language-Specific char Types
C and C++
char is 1 byte. It cannot represent most Unicode characters directly. wchar_t is platform-dependent (2 bytes on Windows, 4 bytes on Linux). Use char32_t for guaranteed 1:1 code point storage, or char8_t (C++20) for UTF-8:
char a = 'A'; // OK: ASCII
char e = 'é'; // NOT OK: multi-byte UTF-8 character
char32_t s = U'☃'; // OK: U+2603 stored as 32-bit integer
char32_t emoji = U'😀'; // OK: U+1F600
Java
Java's char is a 16-bit UTF-16 code unit. Characters outside the BMP (above U+FFFF) require two char values (a surrogate pair). int is used for full code point values:
char a = 'A'; // OK: U+0041
char snowman = '\u2603'; // OK: U+2603
// char emoji = '\uD83D'; // Only the high surrogate - WRONG
// Use int for code points
int emoji = "😀".codePointAt(0); // 128512 (0x1F600)
String.valueOf(Character.toChars(emoji)) // "😀"
// Iterate code points, not chars
"😀A".codePoints().forEach(cp -> {
System.out.println(Integer.toHexString(cp)); // 1f600, then 41
});
Python 3
Python 3's str type stores text as code points (conceptually; CPython uses flexible internal encoding). len() counts code points, indexing returns one-code-point strings:
text = '😀'
len(text) # 1 (one code point)
text[0] # '😀' (one-code-point string)
ord(text[0]) # 128512 (0x1F600)
chr(128512) # '😀'
Rust
Rust's char type represents a single Unicode scalar value (code point, excluding surrogates). String is UTF-8 encoded bytes. Indexing a String by character position is not O(1):
let c: char = '\u{1F600}'; // 😀 as a single char
let code: u32 = c as u32; // 128512
let back: char = char::from_u32(128512).unwrap();
let s = String::from("😀");
s.chars().count() // 1 code point
s.len() // 4 bytes
Grapheme Clusters: Above Code Points
Even code points are not the final level of abstraction. User-perceived characters (grapheme clusters) can consist of multiple code points:
é = U+00E9 (one code point, NFC form)
é = U+0065 + U+0301 (two code points: e + combining acute, NFD form)
👨👩👧👦 = U+1F468 + U+200D + U+1F469 + U+200D + U+1F467 + U+200D + U+1F466
(7 code points: 4 emoji + 3 ZWJ characters)
For user-facing string operations (length display, truncation, cursor movement), grapheme cluster segmentation is the correct abstraction.