SymbolFYI

How to Use the SymbolFYI String Length Calculator

Tools Guides 九月 2, 2025

"How long is this string?" is a question with at least five different correct answers depending on who you ask. A user, a Python script, a JavaScript function, a MySQL database, and a Swift application will each give you a different number for the same string — and they're all right within their own definition of length. The SymbolFYI String Length Calculator shows you all five measurements at once, explains what each one means, and helps you decide which measurement is appropriate for your context.

What the String Length Calculator Does

Paste any text into the calculator and it immediately computes and displays:

  1. Grapheme clusters — the user-perceived character count
  2. Unicode code points — the Unicode-standard unit count
  3. UTF-8 bytes — the byte count in the dominant web encoding
  4. UTF-16 code units — the unit count in JavaScript and Java strings
  5. ASCII-safe length — the length after percent-encoding non-ASCII characters (relevant for URLs and some APIs)

These five numbers are the same for simple ASCII text. They diverge as soon as the string contains accented characters, emoji, CJK characters, combining marks, or other non-ASCII content. The calculator shows the divergence clearly and includes a color-coded indicator when measurements disagree significantly.

For any string where the numbers disagree, the calculator provides an expandable breakdown showing exactly which characters are contributing extra bytes, extra code points, or extra code units beyond what the grapheme count would suggest.

The Five Length Measurements Explained

1. Grapheme Clusters

A grapheme cluster is what a human perceives as a single character — the unit that corresponds to one cursor movement in a text editor, one backspace keypress, and one visual unit on screen. The grapheme cluster count is the closest thing to "characters as users understand them."

The Unicode standard defines grapheme cluster boundaries in Unicode Annex #29. The key rules:

  • A base character followed by one or more combining marks counts as one grapheme cluster (even though it's multiple code points)
  • An emoji followed by a variation selector counts as one grapheme cluster
  • A sequence of regional indicator letters forming a flag emoji counts as one grapheme cluster (two code points)
  • An emoji ZWJ sequence (joined with U+200D) counts as one grapheme cluster (though it may be many code points)

Use grapheme clusters when: displaying a character count to users, checking if a name fits in a UI field, measuring what a person types, truncating text at a user-visible character boundary.

Languages that count this way natively: Swift (str.count), Perl 6/Raku, ICU library.

Example divergence: The string 👨‍👩‍👧‍👦 (family emoji) is 1 grapheme, 7 code points (4 emoji + 3 ZWJs), and 25 UTF-8 bytes.

2. Unicode Code Points

A code point is an abstract number in the Unicode standard — values between U+0000 and U+10FFFF assigned to characters. The code point count is the number of Unicode scalar values in the string (excluding unpaired surrogates).

Code point count equals grapheme cluster count for simple text (ASCII, most Latin languages). It diverges when: - Combining marks are present: é as U+0065 + U+0301 is 2 code points, 1 grapheme - Emoji ZWJ sequences: multiple emoji joined by zero-width joiners count as multiple code points but one grapheme - Flag emoji: two regional indicator code points form one grapheme

Use code points when: reasoning about Unicode structure, working with low-level Unicode APIs, comparing against Unicode standard limits, determining whether normalization is needed.

Languages that count this way: Python 3 (len(str)), Ruby (str.length), Swift (str.unicodeScalars.count), Rust (str.chars().count()).

Example divergence: café with the é decomposed into base + combining accent is 5 code points but 4 graphemes.

3. UTF-8 Bytes

The byte count when the string is encoded in UTF-8, which is the encoding used for virtually all web content, most databases (when configured correctly), and modern file systems. UTF-8 is a variable-width encoding:

Code point range UTF-8 bytes per character
U+0000–U+007F 1 byte (ASCII range)
U+0080–U+07FF 2 bytes
U+0800–U+FFFF 3 bytes
U+10000–U+10FFFF 4 bytes

This means the byte count grows significantly when strings contain non-ASCII content:

  • hello = 5 bytes (all ASCII)
  • héllo = 6 bytes (é is 2 bytes)
  • こんにちは = 15 bytes (each hiragana is 3 bytes, 5 characters × 3)
  • 😀😀😀 = 12 bytes (each emoji is 4 bytes, 3 emoji × 4)

Use UTF-8 bytes when: sizing database fields, checking API payload size limits, computing file size for text, validating input against byte-based length constraints, working with systems that impose byte limits (MySQL VARCHAR, many key-value stores, HTTP header limits).

Practical danger: MySQL's utf8 charset (not utf8mb4) is limited to 3-byte sequences, silently truncating or rejecting emoji. A column defined as VARCHAR(100) with utf8 charset cannot store even a single 4-byte emoji character without error or silent data loss.

4. UTF-16 Code Units

UTF-16 represents each Unicode code point as one or two 16-bit code units. Code points in the Basic Multilingual Plane (U+0000–U+FFFF) use one code unit. Code points in supplementary planes (U+10000 and above) use two code units — a surrogate pair.

This encoding is used internally by JavaScript (ECMAScript), Java, C#/.NET, and Windows APIs. The consequence is that str.length in these languages returns the UTF-16 code unit count, not the grapheme count or even the code point count.

The JavaScript string length problem:

"hello".length       // 5 — correct, all BMP
"café".length        // 4 — correct, é is BMP (U+00E9)
"😀".length          // 2 — WRONG for user-perceived length; emoji is a surrogate pair
"😀😀😀".length      // 6 — three emoji, each 2 code units
[..."😀😀😀"].length  // 3 — spread into array to count code points correctly

Use UTF-16 code units when: working with JavaScript string APIs, calculating character positions in JavaScript strings (for substring, indexOf, etc.), interfacing with Java or .NET string lengths, working with Windows APIs.

Practical danger: JavaScript's String.prototype.charAt(i) and str[i] return UTF-16 code units, not characters. Accessing the second code unit of a surrogate pair gives a dangling surrogate — an invalid character. Always use [...str] spread or str.codePointAt() when you need to iterate Unicode characters in JavaScript.

5. ASCII-Safe (Percent-Encoded) Length

For URL-related contexts, this measurement shows the length of the string after percent-encoding all non-ASCII bytes. ASCII characters remain as-is; non-ASCII characters are encoded as %XX sequences (one %XX per byte, three characters per byte).

For example: - hellohello (5 characters, ASCII-safe) - cafécaf%C3%A9 (9 characters, because é encodes to two bytes %C3%A9) - 日本%E6%97%A5%E6%9C%AC (18 characters, each kanji is 3 bytes = 9 characters each)

Use percent-encoded length when: checking if a URL will exceed browser or server limits, validating input that will appear in a URL path or query string, building URLs with non-ASCII path segments, checking compliance with RFC 3986.

Why Measurements Disagree: Practical Implications

The calculator makes the disagreements concrete with a visual breakdown. For a string like café☕, the measurements would be:

Measurement Count Why
Grapheme clusters 5 c, a, f, é, ☕
Code points 5 or 6 depends on é normalization form
UTF-8 bytes 8 c(1)+a(1)+f(1)+é(2)+☕(3)
UTF-16 code units 5 or 6 ☕ (U+2615) is BMP, so 1 unit
Percent-encoded 14 ASCII chars as-is, é as %C3%A9 (6), ☕ as %E2%98%95 (9)

The calculator shows this breakdown with each character's contribution to each measurement highlighted, so you can immediately see which characters are driving the divergence.

Database Field Sizing

One of the most practically important uses of the String Length Calculator is determining the right database field size for text that may contain non-ASCII characters.

The MySQL VARCHAR Trap

MySQL's VARCHAR(n) specifies n in characters when the column charset is utf8mb4, but in the default utf8 charset, it's limited to 3-byte sequences. If your application stores user-generated content with emoji, you must use utf8mb4:

-- WRONG: silently truncates 4-byte emoji
username VARCHAR(50) CHARACTER SET utf8

-- CORRECT: supports full Unicode including emoji
username VARCHAR(50) CHARACTER SET utf8mb4

With utf8mb4, a VARCHAR(50) can store up to 50 characters but may use up to 200 bytes of storage (50 × 4 bytes maximum per character). Indexes on such columns have stricter limits.

PostgreSQL

PostgreSQL's VARCHAR(n) and CHAR(n) count in characters (code points), not bytes. Emoji and multi-byte characters each count as one character. PostgreSQL uses the system encoding (set at database creation) and requires UTF-8 for full Unicode support. The byte count is still relevant for row size estimation and index size.

Redis and Key-Value Stores

Redis strings are binary-safe and the size limit (512MB per key) is in bytes. For Redis key names, the constraint is practical performance rather than a hard limit — very long keys waste memory and slow key comparison. The UTF-8 byte count from the calculator gives you the actual Redis key length for non-ASCII keys.

API Payload Limits

Many APIs impose size limits measured in bytes, not characters:

  • Twitter/X: 280 characters (grapheme clusters), but with weighted character counting for URLs and certain Unicode ranges
  • SMS (GSM 7-bit): 160 characters for all-ASCII, 70 characters per segment for UCS-2 (UTF-16) when non-ASCII is present
  • HTTP headers: typically 8KB per header, measured in bytes
  • Many REST APIs: payload limits in bytes (commonly 1MB, 10MB, or 100MB)

The String Length Calculator's byte count tells you the actual payload size for API calls that contain non-ASCII text.

UI Truncation

User interface truncation — cutting text to fit in a fixed-width container — should operate on grapheme clusters, not bytes or code units. If you truncate at a byte or code unit boundary, you may cut through a multi-byte character, producing a corrupted character at the end of the truncated string.

The calculator helps you plan truncation logic:

  1. Decide the target length in grapheme clusters (what the user perceives)
  2. Use the calculator to find the corresponding UTF-8 byte count for strings at that length
  3. Implement truncation on the grapheme cluster boundary

In most languages, grapheme-cluster-aware truncation requires a library: - Python: grapheme package, or unicodedata-based custom logic - JavaScript: Intl.Segmenter (modern browsers and Node.js 16+) - Swift: built-in String iteration (already grapheme-cluster-based) - Java/Android: BreakIterator.getCharacterInstance()

Keyboard Shortcut and URL Sharing

After entering text, the URL updates with the text encoded as a query parameter — making the current calculation shareable. Copy the URL from the browser address bar to share a calculation with a teammate. The URL format is /tools/string-length/?text=encoded-text, with the text percent-encoded for URL safety.

The input field clears with Escape and accepts a paste from clipboard with the standard system shortcut (Ctrl+V or Cmd+V). The calculation resets immediately when the input field is cleared.

For per-character detail beyond the summary counts shown here, the Character Analyzer at /tools/character-counter/ provides the full Unicode property breakdown. For converting a specific character to its UTF-8 or UTF-16 representation, use the Encoding Converter at /tools/encoding-converter/.

相关符号

相关术语

相关工具

更多指南