SymbolFYI

How to Use the SymbolFYI Character Analyzer

Tools Guides 6月 3, 2025

The Character Analyzer at /tools/character-counter/ does something that sounds simple — count characters — but quickly reveals how surprisingly complex that question is in Unicode. Beyond counts, the tool breaks any text into its individual Unicode code points, showing you the name, category, script, and encoding of every character in your string. This guide explains every feature and when to reach for each one.

What the Character Analyzer Does

You paste text. The tool immediately shows you:

  • How many grapheme clusters it contains (what you'd typically call "characters")
  • How many Unicode code points it contains (which may differ)
  • How many bytes it takes in UTF-8 (which may differ again)
  • A character-by-character breakdown with full Unicode properties for each element

This breakdown is invaluable for debugging text that behaves unexpectedly — strings that claim to be equal but aren't, character limits that behave inconsistently across platforms, or text copied from external sources that contains invisible characters you can't see.

Text Input and Real-Time Analysis

The input area accepts text from any source — typed directly, pasted from a clipboard, or dragged from another application. Analysis updates in real-time as you type or edit, with no submit button required.

The input area handles: - Plain ASCII text - Accented and diacritical letters (é, ñ, ü) - Right-to-left text (Arabic, Hebrew) - CJK characters (Chinese, Japanese, Korean) - Emoji, including multi-component emoji sequences - Invisible characters (zero-width spaces, directional marks, variation selectors) - Any valid Unicode text, up to the character limit shown in the interface

Pasted text retains its original character composition — the tool does not normalize or transform your input. What you paste is what gets analyzed.

Understanding the Four Counts

The summary row at the top of the analysis panel shows up to four counts for the same text. They will often be different numbers, and understanding why is the main reason to use this tool.

Grapheme Clusters

A grapheme cluster is what a typical user thinks of as one character — a single visible unit of text. The count of grapheme clusters matches what you'd get if you selected your text and counted by cursor positions or by visual characters on screen.

For simple ASCII text, grapheme count equals code point count equals byte count divided by one. As text becomes more complex, these diverge:

  • An accented letter like é can be either one code point (U+00E9, precomposed form) or two code points (U+0065 + U+0301, base letter plus combining accent) — but it is always one grapheme cluster
  • An emoji like 👨‍👩‍👧‍👦 (family: man, woman, girl, boy) is one grapheme but seven code points joined by zero-width joiners
  • A flag emoji like 🇫🇷 is one grapheme but two code points (regional indicator letters F and R)

The grapheme count is the right number to show users for display purposes — "your username is 15 characters" — because it matches their visual experience.

Unicode Code Points

A code point is a single entry in the Unicode standard, identified by a number between U+0000 and U+10FFFF. The code point count is what languages like Python (3.x), Ruby, and Swift report as the string length, and what JavaScript reports as [...str].length (spreading into an array to break surrogate pairs).

Code point count diverges from grapheme count when: - Combining marks are used (base character + separate combining mark = 2 code points, 1 grapheme) - Emoji sequences are present (joined emoji = multiple code points, 1 grapheme)

The code point count is the relevant number when reasoning about Unicode structure, comparing against Unicode limits, or working with lower-level string APIs.

UTF-8 Bytes

The byte count in UTF-8 encoding — the number of bytes the text would occupy when stored or transmitted as UTF-8. This is the number that matters for:

  • Database field sizes (MySQL VARCHAR byte limits)
  • HTTP payload limits
  • File size calculations
  • Any system that stores or transmits the raw byte representation

UTF-8 encodes different characters with different byte counts: - U+0000–U+007F: 1 byte (ASCII) - U+0080–U+07FF: 2 bytes (accented Latin, Arabic, Hebrew, and more) - U+0800–U+FFFF: 3 bytes (most CJK, many symbols) - U+10000–U+10FFFF: 4 bytes (emoji, supplementary characters)

A 20-grapheme string that includes emoji may require 40–60 bytes in UTF-8. If your database column is defined as VARCHAR(50) and sized for bytes, emoji-heavy input will truncate far short of 50 visible characters.

utf-16-code-units">UTF-16 Code Units

JavaScript uses UTF-16 internally, and str.length in JavaScript counts UTF-16 code units — not graphemes, not Unicode code points. Characters below U+10000 are one code unit; characters at U+10000 and above require two code units (a surrogate pair).

This means "😀".length returns 2 in JavaScript, even though it's one grapheme and one code point. The UTF-16 code unit count is relevant when working with JavaScript string APIs, calculating positions in JavaScript strings, or interfacing with systems that use UTF-16 (Windows APIs, Java strings, .NET strings).

The Character Analyzer shows all four counts simultaneously so you can see at a glance where they agree and where they diverge for your specific input.

The Per-Character Breakdown Table

Below the summary counts, a scrollable table shows every character in your text as a separate row. For each character, the table shows:

glyph">Glyph

The character rendered visually. For invisible characters, a placeholder label such as "[ZWSP]" (zero-width space) or "[LRM]" (left-to-right mark) replaces the blank glyph so you can see that something is there.

Code Point

The Unicode code point in both hex (U+XXXX) and decimal. The hex form links to the character's detail page on SymbolFYI for more context.

Name

The official Unicode character name, as defined in the Unicode Character Database. Names are in uppercase by convention: LATIN SMALL LETTER E WITH ACUTE, COPYRIGHT SIGN, ZERO WIDTH SPACE.

For characters that have no official name (such as private-use characters), the table shows a description based on the character's properties.

General Category

The Unicode General Category — a two-letter code plus a descriptive label:

Code Label Examples
Ll Lowercase Letter a, é, ñ
Lu Uppercase Letter A, É, Ñ
Nd Decimal Number 0–9, ١–٩
Po Other Punctuation . , ! ?
Sm Math Symbol + = < > ∑
Zs Space Separator (various space types)
Cf Format Character zero-width joiners, directional marks
So Other Symbol ©, ®, ™
Mn Non-Spacing Mark combining accents

The category column is particularly useful for spotting invisible format characters (category Cf) that might be causing unexpected behavior in string comparisons or text processing.

Script

The writing system the character belongs to: Latin, Cyrillic, Arabic, Han, Common, Inherited. The Common script includes characters like digits, punctuation, and symbols that are shared across scripts. Inherited includes combining marks that take the script of the preceding base character.

Script information is important for security analysis — lookalike characters from different scripts (homoglyphs) will show different script values even if they look visually identical.

Block

The Unicode block the character belongs to. Blocks are contiguous ranges of code points that typically group related characters. Examples: Basic Latin, General Punctuation, Mathematical Operators, Enclosed Alphanumerics.

UTF-8 Bytes

The byte-level encoding of this specific character in UTF-8, shown as hex octets. This makes it easy to see exactly how much space each character contributes to the total byte count.

Detecting Invisible Characters

One of the most practical uses of the Character Analyzer is finding characters that are in a string but invisible to the eye. Common invisible characters include:

Zero-Width Space (U+200B) — looks like nothing, but it's there. Often introduced by copy-pasting from web pages or word processors. Can cause string comparison failures and search mismatches.

Zero-Width Non-Joiner (U+200C) and Zero-Width Joiner (U+200D) — control how adjacent characters connect. The ZWJ is used legitimately in emoji sequences (👨‍💼 is man + ZWJ + briefcase), but stray ZWJs in plain text can cause issues.

Left-to-Right Mark (U+200E) and Right-to-Left Mark (U+200F) — directional markers that affect text flow. Invisible but can cause text to render unexpectedly when mixed-direction content is involved.

Non-Breaking Space (U+00A0) — looks like a regular space but is not. A string containing word1[NBSP]word2 will not match word1 word2 even though both look identical on screen.

Variation Selectors (U+FE00–U+FE0F) — select between text and emoji presentation of a character. followed by VS-15 renders as a text symbol; followed by VS-16, as an emoji. The variation selector is invisible but changes rendering.

In the Character Analyzer, all of these appear in the breakdown table with their correct names and category labels. Invisible characters that are format characters (Cf category) are highlighted in the table with a yellow background to make them immediately visible.

Use Cases

Checking String Byte Length for a Database

You're implementing a username field that your database stores as VARCHAR(50) in a UTF-8 database. The UI should reject usernames over 50 bytes, but users expect a "50 character" limit. Paste a username with emoji into the Character Analyzer to see both the grapheme count (what users perceive) and the UTF-8 byte count (what the database sees). This tells you how to communicate the limit clearly and whether your validation logic needs to count bytes, not characters.

Finding Hidden Characters in Pasted Text

A content editor reports that search isn't finding articles even when the title exactly matches the search query. Paste the article title into the Character Analyzer — if the breakdown shows zero-width spaces or non-breaking spaces that shouldn't be there, you've found the bug. The characters are invisible in the title display but present in the string, causing the exact-match search to fail.

Validating Emoji Display Across Platforms

You're using a multi-component emoji in your app and seeing different rendering on different platforms. Paste the emoji into the Character Analyzer. The code point breakdown shows exactly what Unicode sequence makes up the emoji — if it's a ZWJ sequence, the table shows each component and each ZWJ. This lets you check whether the sequence is a well-formed recommended emoji or a custom combination with inconsistent platform support.

Debugging API String Truncation

An API response is being truncated at what appears to be a random character position — not at your expected 100-character limit. Paste the response body into the Character Analyzer and check the UTF-8 byte count. If the API imposes a byte limit rather than a character limit, the truncation point will make sense. The per-character table shows exactly where byte 100 falls in the character sequence.

Confirming Unicode Normalization

Two strings look identical and contain the same characters but compare as not equal. Paste both into the Character Analyzer in separate sessions and compare the code point sequences. If one uses precomposed characters (e.g., U+00E9 for é) and the other uses combining sequences (U+0065 + U+0301), they're different code point sequences that render identically. You'll need to normalize both strings to the same normalization form before comparing.

Integration with Other Tools

The Character Analyzer connects naturally with the rest of SymbolFYI:

  • For any character in the breakdown table, click its code point link to open the full symbol detail page with complete encoding information.
  • To convert a character you've found into a different encoding format (CSS escape, HTML entity, etc.), use the Encoding Converter at /tools/encoding-converter/.
  • To browse for characters by Unicode block or category, use the Symbol Table at /tools/symbol-table/.

The Character Analyzer is often the first tool to reach for when something unexpected is happening with text — it strips away visual presentation and shows you exactly what's in the string at the Unicode level.

関連記号

関連用語

関連ツール

他のガイド