String Length and Unicode
Calculating the "length" of a string is deceptively complex in Unicode-aware programming. What users perceive as a single character may be represented by multiple code points, and what languages report as string length often reflects internal encoding units rather than user-visible characters. This distinction causes real bugs in user-facing features like character counters, string truncation, and text validation.
The Three Levels of String Length
1. Byte Length
The number of bytes required to store the string in a given encoding:
# Python: byte length depends on encoding
text = '☃' # SNOWMAN (U+2603)
len(text.encode('utf-8')) # 3 bytes (0xE2 0x98 0x83)
len(text.encode('utf-16')) # 4 bytes (BOM + 2 code unit bytes)
len(text.encode('utf-32')) # 6 bytes (BOM + 4 bytes)
2. Code Unit Length
The number of encoding units in the string's internal representation:
// JavaScript: UTF-16 code units
'A'.length // 1 (U+0041, 1 code unit)
'é'.length // 1 (U+00E9, 1 code unit)
'☃'.length // 1 (U+2603, 1 code unit)
'😀'.length // 2 (U+1F600, 2 code units: surrogate pair)
'𝄞'.length // 2 (U+1D11E MUSICAL SYMBOL G CLEF, 2 code units)
3. Code Point Length
The number of Unicode scalar values (code points) in the string:
# Python 3: len() counts code points
len('A') # 1
len('é') # 1
len('☃') # 1
len('😀') # 1 ← Python gets this right
len('👨👩👧👦') # 7 (4 emoji + 3 ZWJ characters)
4. Grapheme Cluster Length
The number of user-perceived characters:
import grapheme
grapheme.length('👨👩👧👦') # 1 (one family emoji)
grapheme.length('café') # 4 (c, a, f, é as composed character)
grapheme.length('e\u0301') # 1 (e + combining acute = one grapheme)
Why This Matters: Real Bugs
Twitter/X Character Counter
A 280-character limit should count grapheme clusters. A naive implementation would: - Under-count tweets with many emoji (each emoji = 1 perceived char, but 2 JS code units) - Mis-truncate text by cutting in the middle of a surrogate pair
String Truncation
// DANGEROUS: may cut in middle of surrogate pair
function truncateDangerous(str, maxLen) {
return str.slice(0, maxLen); // Slices code units
}
truncateDangerous('AB😀CD', 3) // 'AB\uD83D' — broken emoji!
// SAFE: use Array.from which iterates code points
function truncateSafe(str, maxLen) {
return Array.from(str).slice(0, maxLen).join('');
}
truncateSafe('AB😀CD', 3) // 'AB😀' — correct
Database Storage
# PostgreSQL VARCHAR(10) counts characters (code points)
# But MySQL VARCHAR(10) with utf8mb4 also counts characters
# Byte limits matter for some columns:
text = '😀' * 10 # 10 emoji
len(text) # 10 code points
len(text.encode('utf-8')) # 40 bytes (4 bytes each)
# Would overflow a CHAR(10 BYTE) column but not CHAR(10 CHAR)
Correct Length in Each Language
# Python 3: len() = code points (usually what you want)
text = '😀café'
len(text) # 6 code points
len(text.encode('utf-8')) # 9 bytes
import grapheme; grapheme.length(text) # 6 grapheme clusters
// JavaScript: use Array.from or Intl.Segmenter
const text = '😀café';
text.length // 7 (UTF-16 code units)
[...text].length // 6 (code points)
new Intl.Segmenter().segment(text)
|> Array.from |> (a => a.length) // 6 (grapheme clusters)
// Go: len() = bytes, utf8.RuneCountInString() = code points
text := "😀café"
len(text) // 9 (bytes)
utf8.RuneCountInString(text) // 6 (code points)