SymbolFYI

Mojibake

Encoding

Définition

Garbled text that results from decoding data with the wrong character encoding. Common when mixing Latin-1 and UTF-8.

Mojibake (from Japanese: 'character transformation') is the garbled, unreadable text that results from decoding a sequence of bytes using the wrong character encoding. Instead of the intended characters, the reader sees nonsensical symbols, boxes, or a mix of characters from an unrelated script. Mojibake is one of the most common and frustrating text handling bugs in software.

How Mojibake Occurs

Every text file is ultimately a sequence of bytes. When you open a file or receive text over a network, software must choose an encoding to interpret those bytes as characters. If the chosen encoding does not match the encoding used when the bytes were written, the byte-to-character mapping is incorrect, producing mojibake.

Example: A word encoded as Shift-JIS bytes interpreted as Latin-1 would produce garbled Latin symbols instead of the intended Japanese characters.

Common Mojibake Scenarios

Intended Encoding	Decoded As	Typical Symptom
UTF-8	Latin-1	`Ã©` instead of `e-acute`, `â€™` instead of right single quote
Latin-1	UTF-8	Replacement characters or decode errors
Shift-JIS	UTF-8	Random Latin symbols and question-mark boxes
Windows-1252	UTF-8	`â€œ` and `â€` instead of left and right double quotes
UTF-8 with BOM	ASCII or Latin-1	Three extra characters at start of file

The Ã© pattern for e-acute is particularly common on legacy websites. The character U+00E9 in UTF-8 encodes as bytes 0xC3 0xA9. When those bytes are decoded as Latin-1, 0xC3 is A-tilde and 0xA9 is the copyright sign, yielding Ã©.

Diagnosing Mojibake

# Simulate mojibake: encode as UTF-8, decode as Latin-1
original = 'caf\u00e9'   # cafe with e-acute
utf8_bytes = original.encode('utf-8')
mojibake = utf8_bytes.decode('latin-1')
print(mojibake)  # 'cafÃ©'

# Reverse: if you have mojibake, re-encode with the wrong codec
# and decode with the right one
fixed = mojibake.encode('latin-1').decode('utf-8')
print(fixed)  # 'cafe' with correct e-acute

// Node.js: demonstrating the mismatch
const { Buffer } = require('buffer');
const original = 'caf\u00e9';
const utf8Buf = Buffer.from(original, 'utf8');

// Read as latin1 (wrong encoding)
const mojibake = utf8Buf.toString('latin1');
console.log(mojibake);  // garbled: 'cafÃ©'

// Re-encode as latin1 then decode as utf8 to fix
const fixed = Buffer.from(mojibake, 'latin1').toString('utf8');
console.log(fixed);  // correct

Fixing Double-Encoding Mojibake

A particularly tricky form of mojibake occurs when text is encoded twice. If UTF-8 text is stored in a database column configured as Latin-1, then read back and treated as UTF-8, the bytes are double-encoded. The ftfy (fixes text for you) Python library is specifically designed to detect and reverse many common mojibake transformations:

import ftfy
print(ftfy.fix_text('CafÃ©'))  # 'Cafe' with correct e-acute
print(ftfy.fix_text('\u00e2\u20ac\u2122'))  # fixes Windows-1252 trade mark sign mojibake

Common Mojibake Fingerprints

Pattern	Cause
`Ã©` for `e-acute`	UTF-8 read as Latin-1
`â€™` for right single quote	Windows-1252 read as UTF-8
`â€œ` / `â€` for curly quotes	Windows-1252 read as UTF-8
`Ã¼` for `u-umlaut`	UTF-8 read as Latin-1

Prevention

The only reliable prevention is consistent, explicit encoding declarations at every boundary: file creation, database storage (column collation and connection charset), HTTP headers, and HTML meta tags. Using UTF-8 throughout the entire stack eliminates most encoding mismatches, since UTF-8 is universally supported and is the de-facto standard for modern web applications.

Termes associés