Encoding Survival Guide
A 6-part practical series on character encoding — UTF-8 byte structure, mojibake diagnosis, encoding detection, and the Unicode sandwich.
-
1
UTF-8: The Complete Guide to the Web's Dominant Encoding
Everything about UTF-8 — how it works, why it won, byte patterns, BOM handling, validation, and common pitfalls for developers.
-
2
Mojibake: Why Text Turns to Garbage and How to Fix It
Understand mojibake — garbled text from encoding mismatches. Learn to diagnose, fix, and prevent encoding errors in files, databases, and web applications.
-
3
Character Encoding Detection: How Browsers and Tools Guess Your Encoding
How encoding detection works — the algorithm browsers use, statistical detectors like chardet, BOM sniffing, and why detection is never 100% reliable.
-
4
UTF-16 and Surrogate Pairs: Why JavaScript Strings Are Complicated
Understand UTF-16 encoding and surrogate pairs — why emoji have .length 2 in JavaScript, how to handle supplementary characters, and when UTF-16 matters.
-
5
Legacy Encodings: Latin-1, Windows-1252, Shift-JIS, and When You Still Need Them
A practical guide to legacy character encodings — when you'll encounter Latin-1, Windows-1252, Shift-JIS, EUC-KR, and how to convert them to UTF-8.
-
6
Punycode and IDN: How Unicode Domain Names Work
How Internationalized Domain Names work — Punycode encoding, IDNA 2003 vs 2008, homograph attacks, and implementing IDN support in your applications.