How to Use the SymbolFYI Text Diff Tool

Tools Guides Ago 5, 2025

Tabla de contenidos

The SymbolFYI Text Diff Tool is built for a specific problem that general-purpose diff tools cannot solve: finding differences between two strings that look identical to the eye but aren't equal as data. Invisible characters, different Unicode space types, lookalike characters from different scripts, encoding artifacts — the Text Diff Tool exposes them all with character-level precision.

What the Text Diff Tool Does

Standard text diff tools compare lines or words and highlight additions, deletions, and changes. They work well for code and documents, but they fail at the Unicode level because they display text the same way a browser does — rendering look-alike characters the same way regardless of what Unicode code point underlies them.

The Text Diff Tool approaches comparison differently. It operates on Unicode code points, not on rendered glyphs. When two characters look the same but have different code points, the tool flags it as a difference. When two strings appear identical but contain different invisible characters, the difference shows up highlighted in the comparison.

This makes the tool uniquely suited for debugging:

Strings that fail equality checks despite looking the same
Text that behaves differently across platforms or encodings
Security-relevant lookalike substitutions in domain names or identifiers
Copy-paste corruption that introduces invisible characters

Using the Tool: Side-by-Side Input

The Text Diff Tool presents two text areas side by side, labeled Text A (left) and Text B (right). Paste or type your strings into each panel. The comparison runs in real-time — as you type, the diff updates immediately without requiring you to press a compare button.

The panels accept any Unicode text: plain ASCII, accented Latin, CJK characters, emoji, right-to-left text, control characters, and everything in between. The input areas preserve whatever you paste exactly as it arrives — no normalization, no trimming, no invisible character removal. Preserving the raw input is essential for finding invisible character bugs.

Character-Level Diff Highlighting

The primary diff view shows both strings rendered as sequences of character tiles, similar to the Character Analyzer's breakdown table. Each tile represents one Unicode code point and shows the glyph plus the code point below it.

Differences are highlighted by type using color coding:

Green tiles — characters present in Text A but not in Text B (deletions from A's perspective)
Blue tiles — characters present in Text B but not in Text A (additions in B)
Yellow tiles — characters that occupy the same position but have different code points (substitutions)
Gray tiles — matching characters present in both strings at the same position

The yellow substitution tiles are the most interesting category. A yellow tile means two characters look similar (or even identical) in the rendered view above, but they're different Unicode code points. This is where lookalike character attacks, encoding issues, and normalization differences show up.

The Substitution Detail Panel

Clicking a yellow substitution tile opens a side-by-side detail panel showing both characters:

Property	Text A Character	Text B Character
Glyph	‐	-
Code Point	U+2010	U+002D
Name	HYPHEN	HYPHEN-MINUS
Category	Pd (Dash)	Pd (Dash)
UTF-8	E2 80 90	2D
HTML Entity	‐	-

In this example, both characters render as a dash, but they're different code points with different byte representations. A system doing a string comparison would report them as unequal. A system searching for a hyphen-minus (the standard keyboard character) wouldn't match the typographic hyphen.

Detecting Invisible Differences

Invisible characters are the most common cause of "these strings look identical but aren't" bugs. The Text Diff Tool handles them in a specific way:

Invisible characters — zero-width spaces, directional marks, variation selectors, non-breaking spaces, format characters — are rendered as labeled placeholder tiles rather than blank tiles. A zero-width space appears as a tile showing [ZWSP] with code point U+200B. A non-breaking space appears as [NBSP] with code point U+00A0.

This makes invisible character differences immediately obvious. If Text A has word1 word2 with an ordinary space and Text B has word1 word2 with a non-breaking space, the space tile in B shows [NBSP] and is highlighted yellow (substitution). Without this treatment, both strings would look identical and the difference would be invisible.

Common Invisible Difference Scenarios

Non-breaking vs. regular space: Web copy tools and word processors often substitute non-breaking spaces (U+00A0) for regular spaces (U+0020), especially before or after certain characters. The text looks the same but will word-wrap differently and fail exact-match comparisons.

Zero-width space artifacts: Copy-pasting text from websites, particularly content management systems with rich text editors, often introduces zero-width spaces between words. These are entirely invisible and cause puzzling string equality failures.

Directional marks: Text copied from a document that mixes left-to-right and right-to-left content often carries invisible directional marks (U+200E, U+200F, U+202A–U+202E). These are invisible in most contexts but affect rendering and appear in string data.

Variation selectors: Some characters have both a text form and an emoji form selected by variation selectors (U+FE0E for text, U+FE0F for emoji). The base character looks the same in either case, but the presence or absence of the variation selector makes the strings unequal. Text Diff flags the variation selector tile explicitly.

Comparing Different Unicode Normalization Forms

Unicode defines multiple normalization forms for representing the same conceptual character:

NFC (Normalization Form Composed): uses precomposed characters where available. é is U+00E9.
NFD (Normalization Form Decomposed): uses base characters plus combining marks. é is U+0065 + U+0301.
NFKC and NFKD: compatibility forms that additionally unify visually similar characters.

Text from different sources may arrive in different normalization forms. Both forms of é look identical when rendered, but they're different byte sequences — é in NFC is two bytes in UTF-8; in NFD, it's three bytes (one for the base e, two for the combining accent).

The Text Diff Tool exposes normalization differences clearly. If Text A has café in NFC and Text B has café in NFD, the diff shows:

The c, a, f tiles as matching (gray)
The é position as different: Text A has one tile (U+00E9), Text B has two tiles (U+0065 + U+0301)

The tool also offers a Normalize and Compare mode. Toggle this on to apply NFC normalization to both strings before comparing. If two strings are normalized-equal but not literally equal, the comparison switches from showing differences to showing a match. This helps you determine whether a difference is a normalization issue (fixable by normalizing both sides) or a genuine data difference.

Detecting Lookalike Characters

Confusable characters — visually similar characters from different scripts — are a common source of both accidental errors and deliberate security attacks. The Text Diff Tool identifies them by code point, which reveals the underlying identity even when the rendering is identical.

Classic examples of confusable pairs:

Pair	Code Points	Scripts	How They Look
A / А	U+0041 / U+0410	Latin / Cyrillic	A А
o / о	U+006F / U+043E	Latin / Cyrillic	o о
l / І	U+006C / U+0406	Latin / Ukrainian I	l І
1 / ١	U+0031 / U+0661	Latin / Arabic	1 ١
. / ⸼	U+002E / U+2E3C	Latin / Punctuation	. ⸼

When two such characters appear in the same position across Text A and Text B, the substitution detail panel shows their different scripts clearly — Latin versus Cyrillic, for example — making the confusable nature explicit. The panel also flags when two characters are on the Unicode Confusables list.

This is useful for security review of identifiers, usernames, domain names, or any context where lookalike substitution is a concern.

Comparing Encodings

The Text Diff Tool includes an Encoding Preview mode that shows both strings side by side as byte sequences rather than as character sequences. Select the encoding from a dropdown (UTF-8, UTF-16 LE, UTF-16 BE, Latin-1) and the view switches from the character tile grid to a hex byte display.

Differences between the byte representations are highlighted with the same color scheme. This mode is useful when:

You have a string from a system you suspect is using Latin-1 instead of UTF-8 and you want to see where the byte sequences diverge
You need to verify that two encoding pathways produce identical bytes
You're debugging a mojibake issue and want to see which specific bytes were misinterpreted

Combining the character tile view (to see what the characters are) with the encoding byte view (to see how they're stored) provides a complete picture of both the logical and physical representations of the strings.

Practical Use Cases

Debugging "Identical" Strings That Don't Match in Code

Your code has a string comparison that should succeed — both strings display the same text — but it returns false. Paste both strings into the Text Diff Tool. Look for yellow tiles (substitutions) and labeled invisible character tiles. Almost every case of this bug is caused by one of these: a different space type, an invisible character in one string, or a normalization difference. The tool identifies the exact character position and code point that differs.

Finding Copy-Paste Corruption

A writer has pasted content from a Word document into a web CMS and the resulting HTML contains odd characters. Paste the original Word text into Text A and the content extracted from the CMS into Text B. The diff will show exactly what characters were added, removed, or substituted during the paste. Common culprits: smart quotes replaced with their code equivalents, em dashes split into two hyphens, non-breaking spaces introduced, or zero-width joiners added by the word processor.

Detecting Homoglyph Substitution

You're reviewing a username or domain name that someone has flagged as suspicious. It looks like a legitimate name but doesn't match your records. Paste the legitimate name into Text A and the suspect name into Text B. If homoglyphs are present, the substitution tiles will appear at those positions, and the detail panel will show the different scripts — Latin versus Cyrillic is the classic attack vector.

Validating Localization String Updates

A translator has returned an updated string file. You want to verify that only the content changed — not the surrounding whitespace, punctuation marks, or invisible formatting characters. Paste the original string into Text A and the translation into Text B, then filter the diff view to show only non-letter differences. Any unexpected invisible characters, different quotation mark types, or whitespace changes will appear highlighted.

Comparing Encoding Pipeline Output

You're testing two encoding pathways — perhaps a new serialization library versus your existing one — and want to confirm they produce identical output. Paste the output of each into Text A and Text B. If the diff is clean (all gray, no differences), the pipelines are equivalent for that test case.

Summary Metrics

Above the character tile comparison, the tool displays summary metrics for quick assessment:

Total differences: The number of positions where the strings differ
Substitutions: Positions where different code points appear
Additions: Code points in B not in A
Deletions: Code points in A not in B
Invisible differences: Differences involving invisible or format characters specifically
Normalized-equal: Whether the strings match after NFC normalization

These metrics let you categorize a difference at a glance before diving into the tile-level detail. A high invisible-difference count points to a copy-paste corruption issue. Substitutions with no additions or deletions suggest a normalization or lookalike character issue. Additions and deletions without substitutions are straightforward character presence differences.

For deeper investigation of any individual character found by the diff, use the Character Analyzer at /tools/character-counter/ to see its full Unicode properties, or the Encoding Converter at /tools/encoding-converter/ to see its complete encoding representation in all formats.