Mojibake (from Japanese: 'character transformation') is the garbled, unreadable text that results from decoding a sequence of bytes using the wrong character encoding. Instead of the intended characters, the reader sees nonsensical symbols, boxes, or a mix of characters from an unrelated script. Mojibake is one of the most common and frustrating text handling bugs in software.
How Mojibake Occurs
Every text file is ultimately a sequence of bytes. When you open a file or receive text over a network, software must choose an encoding to interpret those bytes as characters. If the chosen encoding does not match the encoding used when the bytes were written, the byte-to-character mapping is incorrect, producing mojibake.
Example: A word encoded as Shift-JIS bytes interpreted as Latin-1 would produce garbled Latin symbols instead of the intended Japanese characters.
Common Mojibake Scenarios
| Intended Encoding | Decoded As | Typical Symptom |
|---|---|---|
| UTF-8 | Latin-1 | é instead of e-acute, ’ instead of right single quote |
| Latin-1 | UTF-8 | Replacement characters or decode errors |
| Shift-JIS | UTF-8 | Random Latin symbols and question-mark boxes |
| Windows-1252 | UTF-8 | “ and †instead of left and right double quotes |
| UTF-8 with BOM | ASCII or Latin-1 | Three extra characters at start of file |
The é pattern for e-acute is particularly common on legacy websites. The character U+00E9 in UTF-8 encodes as bytes 0xC3 0xA9. When those bytes are decoded as Latin-1, 0xC3 is A-tilde and 0xA9 is the copyright sign, yielding é.
Diagnosing Mojibake
# Simulate mojibake: encode as UTF-8, decode as Latin-1
original = 'caf\u00e9' # cafe with e-acute
utf8_bytes = original.encode('utf-8')
mojibake = utf8_bytes.decode('latin-1')
print(mojibake) # 'café'
# Reverse: if you have mojibake, re-encode with the wrong codec
# and decode with the right one
fixed = mojibake.encode('latin-1').decode('utf-8')
print(fixed) # 'cafe' with correct e-acute
// Node.js: demonstrating the mismatch
const { Buffer } = require('buffer');
const original = 'caf\u00e9';
const utf8Buf = Buffer.from(original, 'utf8');
// Read as latin1 (wrong encoding)
const mojibake = utf8Buf.toString('latin1');
console.log(mojibake); // garbled: 'café'
// Re-encode as latin1 then decode as utf8 to fix
const fixed = Buffer.from(mojibake, 'latin1').toString('utf8');
console.log(fixed); // correct
Fixing Double-Encoding Mojibake
A particularly tricky form of mojibake occurs when text is encoded twice. If UTF-8 text is stored in a database column configured as Latin-1, then read back and treated as UTF-8, the bytes are double-encoded. The ftfy (fixes text for you) Python library is specifically designed to detect and reverse many common mojibake transformations:
import ftfy
print(ftfy.fix_text('Café')) # 'Cafe' with correct e-acute
print(ftfy.fix_text('\u00e2\u20ac\u2122')) # fixes Windows-1252 trade mark sign mojibake
Common Mojibake Fingerprints
| Pattern | Cause |
|---|---|
é for e-acute |
UTF-8 read as Latin-1 |
’ for right single quote |
Windows-1252 read as UTF-8 |
“ / †for curly quotes |
Windows-1252 read as UTF-8 |
ü for u-umlaut |
UTF-8 read as Latin-1 |
Prevention
The only reliable prevention is consistent, explicit encoding declarations at every boundary: file creation, database storage (column collation and connection charset), HTTP headers, and HTML meta tags. Using UTF-8 throughout the entire stack eliminates most encoding mismatches, since UTF-8 is universally supported and is the de-facto standard for modern web applications.