Legacy Encodings: Latin-1, Windows-1252, Shift-JIS, and When You Still Need Them

Reference Encoding Survival Guide 8月 6, 2024

○ 1. UTF-8: The Complete Guide to the Web's Dominant Encoding
○ 2. Mojibake: Why Text Turns to Garbage and How to Fix It
○ 3. Character Encoding Detection: How Browsers and Tools Guess Your Encoding
○ 4. UTF-16 and Surrogate Pairs: Why JavaScript Strings Are Complicated
● 5. Legacy Encodings: Latin-1, Windows-1252, Shift-JIS, and When You Still Need Them
○ 6. Punycode and IDN: How Unicode Domain Names Work

Legacy encodings never fully go away. Your application might be pure UTF-8 from end to end, but eventually you'll receive a CSV from a government agency encoded in Windows-1252, an email attachment in ISO-2022-JP, a database backup from a 2003 system in Latin-1, or an API response from a Korean payment processor in EUC-KR. Understanding these encodings — what they are, where they came from, and how to convert them — is a practical skill that saves debugging time when these files arrive.

Why Legacy Encodings Still Exist

The transition to UTF-8 is not complete and may never fully complete for certain domains:

Installed base: There are millions of files, databases, and systems encoded in pre-Unicode formats. Converting them all is expensive and risky. A financial institution's mainframe system might store customer records in EBCDIC — an IBM encoding that predates ASCII — because the cost of migration exceeds the cost of maintaining conversion shims.

Regulatory requirements: Some government systems require specific encodings for compliance. Japanese tax filings traditionally used Shift-JIS. Korean government web portals used EUC-KR. These requirements change slowly.

Email protocols: The MIME standard allows declaring any charset for email. Old email clients and servers sent messages in ISO-8859-1, Windows-1252, or regional encodings. You still encounter these when parsing email archives.

Hardware devices: POS terminals, industrial controllers, and embedded systems often have fixed character sets with no upgrade path.

Lazy defaults: Windows still defaults to system codepage (often CP1252 or CP932) for file operations in many contexts. A developer on Windows who doesn't specify encoding gets a legacy encoding, and that file gets sent to a Linux server expecting UTF-8.

Latin-1 (ISO-8859-1): The Foundation

ISO-8859-1, commonly called Latin-1, was the dominant encoding for Western European languages in the 1990s. It's a single-byte encoding where:

Bytes 0x00–0x7F: Identical to ASCII
Bytes 0x80–0x9F: C1 control characters (mostly unused in practice)
Bytes 0xA0–0xFF: Latin characters for Western European languages (À, Â, Ä, æ, ç, é, ñ, ü, etc.)

Latin-1 covers English, French, Spanish, German, Italian, Portuguese, Dutch, Swedish, Norwegian, Danish, and Finnish. It does not cover Polish, Czech, Hungarian, Romanian, or other Central/Eastern European languages.

The critical property: every byte value is a valid Latin-1 character. Latin-1 can't produce an invalid byte sequence, which makes it useful as a binary-safe transport encoding — and also means encoding detection can never definitively rule it out.

# Every byte 0-255 is valid Latin-1
for i in range(256):
    bytes([i]).decode('latin-1')  # Never raises

Windows-1252: The Practical Replacement

Windows-1252 (also called CP1252 or "ANSI" in some Windows contexts) is a Microsoft extension of Latin-1. It's almost identical — same bytes in the 0x00–0x7F and 0xA0–0xFF ranges — but replaces the mostly-useless C1 control characters (0x80–0x9F) with useful printable characters:

Hex	Latin-1	Windows-1252
0x80	€-like control	€ (Euro sign)
0x91	Control	' (Left single quotation mark)
0x92	Control	' (Right single quotation mark)
0x93	Control	" (Left double quotation mark)
0x94	Control	" (Right double quotation mark)
0x96	Control	– (En dash)
0x97	Control	— (Em dash)
0x99	Control	™ (Trade mark sign)

This is why mojibake involving curly quotes, em dashes, and the euro sign is so common. Word processors default to Windows-1252, insert "smart quotes" in the 0x91–0x94 range, and those bytes either get decoded as Latin-1 (control characters → invisible) or as UTF-8 (invalid sequence → replacement characters).

The HTML specification explicitly maps iso-8859-1 to Windows-1252 in the browser's encoding label table — recognizing that no real content uses the C1 control characters and that "latin-1" in HTTP headers almost always means Windows-1252 in practice.

# Detecting the difference
data = b'\x93Hello\x94'  # CP1252 curly quotes around "Hello"

data.decode('latin-1')    # '\x93Hello\x94'  — control characters
data.decode('cp1252')     # '"Hello"'        — correct!
data.decode('utf-8')      # UnicodeDecodeError

Rule of thumb: When you see "latin-1" or "iso-8859-1" in a file or email from a Windows system, try cp1252 first. It will decode curly quotes and em dashes correctly where latin-1 won't.

The ISO-8859 Family

Latin-1 was the first in a series of 15 ISO-8859 encodings, each covering a different language group. They're all single-byte extensions of ASCII:

Encoding	Languages	Notes
ISO-8859-1 (Latin-1)	Western European	Foundation encoding
ISO-8859-2 (Latin-2)	Central European (Polish, Czech, Hungarian)	Different from Latin-1 in 0xA0–0xFF
ISO-8859-3 (Latin-3)	Maltese, Esperanto	Rare
ISO-8859-4 (Latin-4)	Baltic	Mostly superseded by 8859-13
ISO-8859-5 (Cyrillic)	Russian, Bulgarian	Competed with KOI8-R
ISO-8859-6 (Arabic)	Arabic	Right-to-left not handled
ISO-8859-7 (Greek)	Greek
ISO-8859-8 (Hebrew)	Hebrew
ISO-8859-9 (Latin-5)	Turkish	Same as Latin-1 but ı replaces ÿ
ISO-8859-13 (Latin-7)	Baltic	Replaces 8859-4
ISO-8859-15 (Latin-9)	Western European + €	Latin-1 + Euro sign

ISO-8859-15 (Latin-9) was created specifically to add the Euro sign (€) and French OE ligature (Œ/œ), filling the gap that forced Western European content to use Windows-1252. It's uncommon but correct.

Shift-JIS: Japanese Encoding

Shift-JIS is the dominant legacy encoding for Japanese text, developed by Microsoft and ASCII Corporation in the 1980s. It's a variable-length encoding where:

Single-byte characters: ASCII (0x00–0x7E) and half-width katakana (0xA1–0xDF)
Double-byte characters: Most kanji and full-width characters

The "shift" in Shift-JIS refers to how the second byte's range was designed to avoid collisions with single-byte ranges — a clever hack that created awkward edge cases for text processing:

import codecs

# Shift-JIS encoded text
sjis_bytes = "日本語テスト".encode('shift-jis')
print(sjis_bytes.hex())  # 93fa967b8ce3 83658358 8374

# Decode back
sjis_bytes.decode('shift_jis')  # Works

# Python codec aliases
# 'shift_jis', 'shift-jis', 'sjis', 'csshiftjis', 's_jis'
# 'shift_jis_2004' — extends Shift-JIS with JIS X 0213
# 'cp932' — Microsoft's variant (common on Windows)

Shift-JIS vs CP932: Microsoft's CP932 (also called "Windows-31J") extends standard Shift-JIS with additional characters including the NEC special characters and IBM extension characters. Many Japanese files labeled "Shift-JIS" are actually CP932. When in doubt, try cp932 first — it's a superset of Shift-JIS.

The encoding detection problem for Shift-JIS: because it's variable-length and the byte ranges partially overlap with ASCII, some byte sequences that are valid ASCII are also valid Shift-JIS. Statistical detection works reasonably well for longer texts with kanji but can misidentify short texts.

EUC-JP: Unix Japanese Encoding

EUC-JP (Extended Unix Code for Japanese) was the standard Japanese encoding on Unix systems before UTF-8. It's cleaner than Shift-JIS in design but less common on Windows:

ASCII bytes (0x00–0x7F) are single-byte, same as ASCII
Japanese characters use 2 or 3 bytes, all with high bits set (0x80+)
No overlap between ASCII and multi-byte sequences

text = "日本語"
text.encode('euc-jp').hex()   # c6fccbdcb8ec
text.encode('shift-jis').hex() # 93fa967b8ce3
text.encode('utf-8').hex()    # e697a5e69cace8aa9e

EUC-JP is less common today but appears in older Unix system logs, some Linux Japanese locale files, and older web archives.

Big5: Traditional Chinese

Big5 is the dominant encoding for Traditional Chinese (used in Taiwan, Hong Kong, Macau, and older overseas Chinese communities). It's a double-byte encoding covering around 13,000 Chinese characters:

text = "繁體中文"
text.encode('big5').hex()       # bc54c5e4a4a4a4e5
text.encode('utf-8').hex()      # e7b981e9ab94e4b8ade69687

Multiple incompatible Big5 variants exist: original Big5, Microsoft CP950, Big5-HKSCS (Hong Kong Supplementary Character Set). When processing Traditional Chinese content from different sources, you may encounter all three. CP950 is the Windows default and extends Big5 with some additional characters.

GB2312, GBK, GB18030: Simplified Chinese

GB2312 is the original Simplified Chinese national standard from 1981, covering about 7,000 characters. GBK (Guojia Biaozhun Kuozhan) extends it to over 21,000 characters. GB18030 is the current Chinese national standard, which maps entirely to Unicode and is capable of encoding the full Unicode range:

Encoding	Characters	Notes
GB2312	~7,000	Original national standard
GBK / CP936	~21,000	Windows default, superset of GB2312
GB18030	Full Unicode	Current standard, includes 4-byte sequences

text = "简体中文"
text.encode('gb2312').hex()   # bce2cce5d6d0cec4
text.encode('gbk').hex()      # same for basic CJK
text.encode('gb18030').hex()  # same for BMP characters
text.encode('utf-8').hex()    # e7ae80e4bd93e4b8ade69687

GB18030 is the safest choice for Simplified Chinese: it's backward-compatible with GBK and GB2312, and it can encode emoji and other supplementary characters. But most files labeled "GB2312" in the wild are actually GBK.

EUC-KR: Korean Encoding

EUC-KR is the traditional Korean encoding, standardized as KSX1001. It covers Hangul syllables and Korean-specific Hanja characters:

text = "한국어"
text.encode('euc-kr').hex()    # c7d1b1b9be01
text.encode('utf-8').hex()     # ed959ceab5adec96b4

# CP949 (Microsoft extension of EUC-KR) handles more characters
text.encode('cp949').hex()     # same for common Hangul

CP949 (also called "UHC" or "Windows-949") is the Microsoft extension of EUC-KR that adds 8,822 additional Hangul syllables. Korean Windows files labeled EUC-KR are often actually CP949. Like the Shift-JIS/CP932 situation, CP949 is safer as a first guess.

KOI8-R: Russian Encoding

KOI8-R was the de facto standard for Russian text on Unix systems before UTF-8. Despite being a single-byte encoding like Windows-1251, it has a clever design: the Cyrillic letters are placed such that if the high bit is stripped, you get roughly phonetically similar Latin letters (К → k, О → o, Р → r). This made KOI8-R text partially readable on ASCII-only systems.

Windows-1251 and Windows-1252 are more common today for Russian Windows systems, but KOI8-R persists in email archives and older Unix configurations.

text = "Привет"
text.encode('koi8-r').hex()     # d0d2c9d7c5d4
text.encode('cp1251').hex()     # cff0e8e2e5f2
text.encode('utf-8').hex()      # d09fd180d0b8d0b2d0b5d182

Converting Legacy Encodings: The Toolkit

iconv (command line)

# List all supported encodings
iconv --list

# Convert file from Shift-JIS to UTF-8
iconv -f shift-jis -t utf-8 japanese.txt > japanese_utf8.txt

# Convert with error handling (//IGNORE skips unrepresentable chars)
iconv -f cp1252 -t utf-8//IGNORE western.txt > western_utf8.txt

# Convert entire directory
for f in *.txt; do
    iconv -f cp1252 -t utf-8 "$f" > "utf8_$f"
done

# Detect encoding first
file -i *.txt
# or
python3 -c "
import chardet, sys
data = open(sys.argv[1], 'rb').read()
print(chardet.detect(data))
" mystery.txt

Python codecs

Python's codec library covers all major legacy encodings. Key codec names and aliases:

import codecs

# List all available encodings
for name in sorted(set(codecs.lookup(name).name for name in dir(codecs) if not name.startswith('_'))):
    pass  # Programmatic lookup

# Common codec names (Python is flexible with aliases)
encodings = {
    'western_european': ['latin-1', 'iso-8859-1', 'cp1252', 'iso-8859-15'],
    'japanese':         ['shift-jis', 'cp932', 'euc-jp', 'iso-2022-jp'],
    'chinese_simplified':  ['gb2312', 'gbk', 'cp936', 'gb18030'],
    'chinese_traditional': ['big5', 'cp950', 'big5-hkscs'],
    'korean':           ['euc-kr', 'cp949'],
    'russian':          ['koi8-r', 'cp1251', 'iso-8859-5'],
}

# Batch conversion with detection
def convert_to_utf8(filepath: str, source_encoding: str | None = None) -> str:
    with open(filepath, 'rb') as f:
        raw = f.read()

    if source_encoding is None:
        import chardet
        detected = chardet.detect(raw)
        source_encoding = detected['encoding'] or 'utf-8'

    text = raw.decode(source_encoding, errors='replace')

    output_path = filepath + '.utf8'
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(text)

    return output_path

JavaScript (Node.js)

const iconv = require('iconv-lite');  // npm install iconv-lite
const fs = require('fs');

// Read Shift-JIS file
const raw = fs.readFileSync('japanese.txt');
const text = iconv.decode(raw, 'Shift_JIS');
console.log(text);

// Convert to UTF-8
const utf8Buffer = iconv.encode(text, 'utf8');
fs.writeFileSync('japanese_utf8.txt', utf8Buffer);

// Check if encoding is supported
iconv.encodingExists('EUC-KR');  // true

Migration Strategy

When migrating a system from a legacy encoding to UTF-8, the order of operations matters:

Audit and classify: Identify every storage system (files, databases, APIs) and its current encoding
Fix the transport layer first: Update HTTP headers, database connection strings, and file-writing code to use UTF-8 before converting stored data
Convert data in one transaction: For databases, use the BLOB trick (see the Mojibake article) to preserve raw bytes during schema migration
Handle mixed states: During transition, some data is old encoding and some is new. Flag or timestamp rows to track conversion status
Validate after migration: Spot-check converted data, especially for characters in the 0x80–0x9F range where Windows-1252 and Latin-1 differ

The most common mistake: changing the application to write UTF-8 without migrating existing data, resulting in a database with mixed encodings and no way to tell which rows are which.

Use our Encoding Converter to inspect suspicious bytes and see how a byte sequence decodes under different legacy encodings — helpful when diagnosing whether a file is Latin-1, Windows-1252, or something else.

Legacy encodings represent decades of computing history encoded in byte patterns. They're not going away, but with the right conversion tools and an understanding of which encoding was used where, they're manageable rather than mysterious.

Next in Series: Punycode and IDN: How Unicode Domain Names Work — how internationalized domain names encode Unicode characters as ASCII-compatible labels, and the security implications you need to know.