Mojibake: Why Text Turns to Garbage and How to Fix It

Reference Encoding Survival Guide Jui 25, 2024

○ 1. UTF-8: The Complete Guide to the Web's Dominant Encoding
● 2. Mojibake: Why Text Turns to Garbage and How to Fix It
○ 3. Character Encoding Detection: How Browsers and Tools Guess Your Encoding
○ 4. UTF-16 and Surrogate Pairs: Why JavaScript Strings Are Complicated
○ 5. Legacy Encodings: Latin-1, Windows-1252, Shift-JIS, and When You Still Need Them
○ 6. Punycode and IDN: How Unicode Domain Names Work

Table des matières

You open a file and see Ã© where é should be. Your database shows å¥³ instead of 女. An API response renders â€œ in place of ". This is mojibake — the Japanese word for "character transformation" (文字化け) — and it's one of the most common and confusing problems in software that handles text.

Mojibake always has the same root cause: text was encoded with one encoding and decoded with a different one. The garbled result is deterministic and reversible. Once you understand the mechanism, you can diagnose any mojibake pattern and recover the original text.

The Mechanics of Mojibake

Every encoding maps byte values to characters. When bytes encoded as UTF-8 are interpreted as Latin-1 (ISO-8859-1), each byte is mapped to the Latin-1 character at that position. Because UTF-8 multi-byte sequences use bytes in the 0x80–0xFF range, and Latin-1 maps those same bytes to specific characters, the result is recognizable garbage rather than random noise.

Pattern 1: UTF-8 Read as Latin-1

This is the most common mojibake pattern in web applications. A character like é (U+00E9) encodes in UTF-8 as bytes 0xC3 0xA9. Latin-1 maps:

0xC3 → Ã (U+00C3, Latin capital A with tilde)
0xA9 → © (U+00A9, copyright sign)

So é becomes Ã©. Here is the full pattern for common accented characters:

Original	UTF-8 bytes	Read as Latin-1
`é`	C3 A9	`Ã©`
`è`	C3 A8	`Ã¨`
`ü`	C3 BC	`Ã¼`
`ñ`	C3 B1	`Ã±`
`€`	E2 82 AC	`â¬`
`"` (U+201C)	E2 80 9C	`â€œ`
`"` (U+201D)	E2 80 9D	`â€`
`—` (em dash)	E2 80 94	`â€"`

If you see Ã followed by a character in the ©®¼½¾ range, you're almost certainly looking at UTF-8 decoded as Latin-1.

Pattern 2: Latin-1 Read as UTF-8

The reverse is less common but more destructive. Latin-1 bytes in the 0x80–0xFF range are not valid UTF-8 lead bytes or continuation bytes in isolation. A Latin-1 é is a single byte 0xE9, which in UTF-8 would be the lead byte of a 3-byte sequence. Without the expected continuation bytes following it, the UTF-8 decoder either raises an error or substitutes the replacement character U+FFFD (displayed as ? or <22>).

# Latin-1 byte read as UTF-8
>>> b'\xe9'.decode('latin-1')
'é'
>>> b'\xe9'.decode('utf-8', errors='replace')
'<22>'
>>> b'\xe9'.decode('utf-8', errors='strict')
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 ...

This pattern appears when a database column stores Latin-1 data and an application tries to display it as UTF-8. The original data becomes unrecoverable if the replacement character <22> is stored rather than the original bytes.

Pattern 3: Double Encoding

Double encoding occurs when already-encoded text is encoded again. UTF-8 text stored in a database with latin1 character set is particularly prone to this. The application encodes é to UTF-8 bytes 0xC3 0xA9, the database interprets those as Latin-1 characters Ã©, stores them as two code points, and then those two code points are encoded again as UTF-8 on retrieval.

The visible result: Ã© becomes Ã\u00A9 in memory or something like Ã© on screen — two characters where one should be, and they look like Latin-1 mojibake because that's exactly what they are.

Diagnosing Mojibake

The fastest diagnostic approach is to look at the raw bytes and match them against known patterns.

def diagnose_mojibake(text: str) -> dict[str, str | bytes]:
    """Try to recover original text from common mojibake patterns."""
    results = {}

    # Pattern: UTF-8 was read as Latin-1
    # Re-encode as Latin-1 to get the original bytes, then decode as UTF-8
    try:
        original_bytes = text.encode('latin-1')
        recovered = original_bytes.decode('utf-8')
        results['utf8_as_latin1'] = recovered
    except (UnicodeEncodeError, UnicodeDecodeError):
        pass

    # Pattern: Latin-1 was read as UTF-8 (data is lost if replacement chars used)
    try:
        original_bytes = text.encode('utf-8')
        recovered = original_bytes.decode('latin-1')
        results['latin1_as_utf8'] = recovered
    except (UnicodeEncodeError, UnicodeDecodeError):
        pass

    # Pattern: Windows-1252 read as Latin-1
    try:
        original_bytes = text.encode('latin-1')
        recovered = original_bytes.decode('cp1252')
        results['cp1252_as_latin1'] = recovered
    except (UnicodeEncodeError, UnicodeDecodeError):
        pass

    return results

# Example usage
garbled = "CafÃ©"
print(diagnose_mojibake(garbled))
# {'utf8_as_latin1': 'Café', ...}

The Python library ftfy (fixes text for you) automates this diagnosis for many common patterns:

import ftfy

print(ftfy.fix_text("CafÃ©"))           # → Café
print(ftfy.fix_text("â€œHelloâ€"))      # → "Hello"
print(ftfy.fix_text("Ã¼ber"))           # → über

ftfy uses a scoring heuristic to determine whether a fix makes the text "more like natural language" — it won't blindly apply transformations that make things worse.

For shell diagnostics, file -i reports detected encoding:

file -i document.txt
# document.txt: text/plain; charset=utf-8
# document.txt: text/plain; charset=iso-8859-1

# Check raw bytes around a suspicious character
hexdump -C document.txt | grep -A2 -B2 "c3"

Fixing Mojibake in Files

For individual files, the iconv command-line tool converts between encodings:

# Convert Latin-1 file to UTF-8
iconv -f latin1 -t utf-8 input.txt > output.txt

# Convert Windows-1252 to UTF-8
iconv -f cp1252 -t utf-8 input.txt > output.txt

# If you're unsure of the source encoding, try chardet first
python3 -c "import chardet; print(chardet.detect(open('input.txt','rb').read()))"
# {'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}

In Python, when you know the mismatch:

# File was read as wrong encoding; re-encode to get original bytes
def fix_utf8_read_as_latin1(garbled: str) -> str:
    """Fix text that was UTF-8 but decoded as Latin-1."""
    return garbled.encode('latin-1').decode('utf-8')

# Fix a file in-place
def fix_encoding_mismatch(filepath: str, wrong_encoding: str, correct_encoding: str) -> None:
    with open(filepath, encoding=wrong_encoding) as f:
        content = f.read()
    fixed = content.encode(wrong_encoding).decode(correct_encoding)
    with open(filepath, 'w', encoding='utf-8') as f:
        f.write(fixed)

fix_encoding_mismatch('data.txt', 'latin-1', 'utf-8')

Fixing Mojibake in MySQL: The latin1 → utf8mb4 Migration

The most painful mojibake scenario in web development is a MySQL database where the connection charset was latin1 while the application was sending UTF-8. The database stored the raw UTF-8 bytes as if they were Latin-1 characters. The data looks correct in the application (because it sends and receives the same bytes) but is broken at the database level.

Step 1: Verify the problem

-- Check what the database thinks it's storing
SELECT HEX(column_name), column_name FROM your_table LIMIT 5;
-- If you see E28099 for an apostrophe, it's UTF-8 bytes stored in latin1 columns

-- Check connection and table charsets
SHOW VARIABLES LIKE 'character_set%';
SHOW CREATE TABLE your_table;

Step 2: Convert without double-encoding

The naive approach (ALTER TABLE ... CONVERT TO CHARACTER SET utf8mb4) re-interprets the bytes, which corrupts already-correct data. The correct approach:

-- 1. Change column type to BLOB first (preserves raw bytes)
ALTER TABLE posts MODIFY body BLOB;

-- 2. Now convert BLOB to utf8mb4 (MySQL reads bytes as utf8mb4)
ALTER TABLE posts MODIFY body TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

This two-step process works because the BLOB conversion preserves the raw byte sequence, and then the TEXT conversion interprets those bytes as UTF-8.

Step 3: Fix the connection charset

-- In your Django settings or MySQL config
# settings.py
DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql',
        'OPTIONS': {
            'charset': 'utf8mb4',
            'init_command': "SET sql_mode='STRICT_TRANS_TABLES'",
        },
    }
}

# my.cnf
[mysql]
default-character-set = utf8mb4

[mysqld]
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci

HTTP Headers and HTML Meta Charset

A significant source of mojibake in web applications is the mismatch between the declared encoding and the actual encoding.

The encoding priority order in browsers:

HTTP Content-Type header (highest priority)
BOM at start of document
<meta charset> or <meta http-equiv="Content-Type"> tag
Browser's encoding detection heuristics

Always declare encoding in the HTTP header — it overrides everything:

Content-Type: text/html; charset=utf-8

And in the HTML, within the first 1024 bytes (before the parser might give up):

<meta charset="UTF-8">

The <meta charset> must appear before any characters outside ASCII to take effect. Browsers read the document in chunks, and if they encounter non-ASCII bytes before finding the charset declaration, they may have already committed to a wrong encoding.

For Django:

# Django sets Content-Type: text/html; charset=utf-8 by default
# Verify it's not being overridden anywhere
# In a view:
from django.http import HttpResponse
response = HttpResponse(content, content_type='text/html; charset=utf-8')

Preventing Mojibake at System Boundaries

Most mojibake is preventable by being explicit at every system boundary where text crosses:

File I/O: Always specify encoding:

open('file.txt', encoding='utf-8')          # reading
open('file.txt', 'w', encoding='utf-8')     # writing

Database: Set charset in connection string, not just table definition:

DATABASE_URL=mysql://user:pass@host/db?charset=utf8mb4

HTTP clients: Specify encoding when decoding responses:

import requests
response = requests.get(url)
response.encoding = 'utf-8'   # Override detected encoding
text = response.text

# Or use content for raw bytes and decode manually
text = response.content.decode('utf-8')

CSV/Excel: Python's csv module relies on the file's encoding; Excel often produces UTF-8-BOM or Windows-1252:

import csv
# Use utf-8-sig to handle optional BOM
with open('data.csv', encoding='utf-8-sig') as f:
    reader = csv.DictReader(f)

APIs: When receiving JSON, the Content-Type should include charset. Most JSON is UTF-8, but if you're parsing raw bytes, always decode first:

import json
data = json.loads(response.content.decode('utf-8'))
# Not: json.loads(response.text)  # relies on correct encoding detection

The Diagnostic Toolkit

When you encounter mojibake and aren't sure of the encoding, use our Encoding Converter to paste the garbled text and inspect what bytes are present. The tool shows the UTF-8, Latin-1, and Windows-1252 interpretations of the underlying bytes, making it easy to identify which mismatch occurred.

The general diagnostic algorithm:

Get the raw bytes (use hexdump, .encode('latin-1'), or inspect the source)
Look for patterns in the 0xC3–0xC5 range (UTF-8 lead bytes for Latin Extended)
Try ftfy.fix_text() for automated repair
If the data is in a database, check HEX(column) to see the actual stored bytes
Check every boundary: file read, DB connection, HTTP header, HTML meta tag

The key insight: mojibake is always reversible if you still have the garbled text in the wrong encoding. The bytes haven't changed — only the interpretation has. As long as no data has been discarded (replacement characters substituted, truncation at invalid sequences), you can recover the original.

Next in Series: Character Encoding Detection: How Browsers and Tools Guess Your Encoding — understanding the algorithms that determine encoding when it isn't declared.