SymbolFYI

String Length vs Character Count

Programming & Dev

Định nghĩa

Why str.length in JavaScript returns UTF-16 code units, not visual characters — and how to count graphemes correctly.

String Length and Unicode

Calculating the "length" of a string is deceptively complex in Unicode-aware programming. What users perceive as a single character may be represented by multiple code points, and what languages report as string length often reflects internal encoding units rather than user-visible characters. This distinction causes real bugs in user-facing features like character counters, string truncation, and text validation.

The Three Levels of String Length

1. Byte Length

The number of bytes required to store the string in a given encoding:

# Python: byte length depends on encoding
text = '☃'  # SNOWMAN (U+2603)
len(text.encode('utf-8'))   # 3 bytes (0xE2 0x98 0x83)
len(text.encode('utf-16'))  # 4 bytes (BOM + 2 code unit bytes)
len(text.encode('utf-32'))  # 6 bytes (BOM + 4 bytes)

2. Code Unit Length

The number of encoding units in the string's internal representation:

// JavaScript: UTF-16 code units
'A'.length      // 1 (U+0041, 1 code unit)
'é'.length      // 1 (U+00E9, 1 code unit)
'☃'.length      // 1 (U+2603, 1 code unit)
'😀'.length     // 2 (U+1F600, 2 code units: surrogate pair)
'𝄞'.length     // 2 (U+1D11E MUSICAL SYMBOL G CLEF, 2 code units)

3. Code Point Length

The number of Unicode scalar values (code points) in the string:

# Python 3: len() counts code points
len('A')     # 1
len('é')     # 1
len('☃')     # 1
len('😀')    # 1  ← Python gets this right
len('👨‍👩‍👧‍👦')  # 7 (4 emoji + 3 ZWJ characters)

4. Grapheme Cluster Length

The number of user-perceived characters:

import grapheme
grapheme.length('👨‍👩‍👧‍👦')  # 1 (one family emoji)
grapheme.length('café')       # 4 (c, a, f, é as composed character)
grapheme.length('e\u0301')    # 1 (e + combining acute = one grapheme)

Why This Matters: Real Bugs

Twitter/X Character Counter

A 280-character limit should count grapheme clusters. A naive implementation would: - Under-count tweets with many emoji (each emoji = 1 perceived char, but 2 JS code units) - Mis-truncate text by cutting in the middle of a surrogate pair

String Truncation

// DANGEROUS: may cut in middle of surrogate pair
function truncateDangerous(str, maxLen) {
  return str.slice(0, maxLen);  // Slices code units
}

truncateDangerous('AB😀CD', 3)  // 'AB\uD83D' — broken emoji!

// SAFE: use Array.from which iterates code points
function truncateSafe(str, maxLen) {
  return Array.from(str).slice(0, maxLen).join('');
}

truncateSafe('AB😀CD', 3)  // 'AB😀' — correct

Database Storage

# PostgreSQL VARCHAR(10) counts characters (code points)
# But MySQL VARCHAR(10) with utf8mb4 also counts characters
# Byte limits matter for some columns:

text = '😀' * 10  # 10 emoji
len(text)              # 10 code points
len(text.encode('utf-8'))  # 40 bytes (4 bytes each)
# Would overflow a CHAR(10 BYTE) column but not CHAR(10 CHAR)

Correct Length in Each Language

# Python 3: len() = code points (usually what you want)
text = '😀café'
len(text)                          # 6 code points
len(text.encode('utf-8'))          # 9 bytes
import grapheme; grapheme.length(text)  # 6 grapheme clusters

// JavaScript: use Array.from or Intl.Segmenter
const text = '😀café';
text.length                         // 7 (UTF-16 code units)
[...text].length                    // 6 (code points)
new Intl.Segmenter().segment(text)
  |> Array.from |> (a => a.length)  // 6 (grapheme clusters)

// Go: len() = bytes, utf8.RuneCountInString() = code points
text := "😀café"
len(text)                             // 9 (bytes)
utf8.RuneCountInString(text)          // 6 (code points)

Thuật ngữ liên quan