SymbolFYI

Windows-1252

Encoding

คำจำกัดความ

A superset of Latin-1 used by default in legacy Windows applications, with extra characters in the 0x80–0x9F range.

Windows-1252 (also called CP1252 or WinLatin-1) is a character encoding developed by Microsoft as an extension of Latin-1 (ISO 8859-1). It adds 27 useful printable characters in the byte range 0x80-0x9F -- a region that Latin-1 leaves as undefined C1 control characters. Windows-1252 became ubiquitous in legacy web content produced on Windows systems and remains an important encoding to understand for handling real-world text.

The Key Difference from Latin-1

Latin-1 and Windows-1252 are identical for bytes 0x00-0x7F (ASCII) and 0xA0-0xFF. The critical difference is the 0x80-0x9F range:

Latin-1: These 32 bytes are C1 control characters (non-printable, mostly unused)
Windows-1252: These bytes are remapped to useful printable characters

Byte	Windows-1252 Character	Latin-1
0x80	Euro sign	C1 control
0x82	Single low-9 quotation mark	C1 control
0x83	Florin sign	C1 control
0x84	Double low-9 quotation mark	C1 control
0x85	Horizontal ellipsis	C1 control
0x91	Left single quotation mark	C1 control
0x92	Right single quotation mark	C1 control
0x93	Left double quotation mark	C1 control
0x94	Right double quotation mark	C1 control
0x96	En dash	C1 control
0x97	Em dash	C1 control
0x99	Trade mark sign	C1 control

The most practically important addition is 0x80 for the Euro sign. Latin-1 has no Euro sign, which was introduced in 1999 after Latin-1 was standardized.

Browser Behavior and the HTML5 Specification

Because Windows-1252 was so prevalent on the early web, the HTML5 specification mandates that browsers treat a declared charset of iso-8859-1 (Latin-1) as Windows-1252. This means <meta charset='iso-8859-1'> is interpreted as Windows-1252 by all conformant browsers:

When a user agent decodes a byte stream labeled as iso-8859-1, it must decode using the Windows-1252 decoder. -- WHATWG Encoding Standard

This de-facto aliasing reflects the reality that almost no web content that declared Latin-1 was actually pure Latin-1 -- it was Windows-1252.

Common Mojibake from Windows-1252

The curly quotes and dashes added in 0x80-0x9F are a frequent source of mojibake when Windows-1252 content is decoded as UTF-8. Bytes in the 0x80-0x9F range are invalid UTF-8 start bytes, so they become replacement characters or decode errors:

# Detecting and converting Windows-1252 content
bytes_cp1252 = b'He said \x93Hello\x94'  # curly double quotes
print(bytes_cp1252.decode('windows-1252'))
# He said [left-double-quote]Hello[right-double-quote]

# Converting to UTF-8
utf8_output = bytes_cp1252.decode('windows-1252').encode('utf-8')
print(utf8_output.hex())
# correctly encoded UTF-8 with curly quote bytes

// Browser: TextDecoder supports 'windows-1252'
const decoder = new TextDecoder('windows-1252');
const bytes = new Uint8Array([0x93, 0x48, 0x65, 0x6c, 0x6c, 0x6f, 0x94]);
console.log(decoder.decode(bytes));  // [left-quote]Hello[right-quote]

When You Encounter Windows-1252

Windows-1252 appears frequently in:

Legacy HTML pages without a charset declaration or with charset=iso-8859-1
Text files created by older versions of Notepad or Word
CSV exports from older Windows software
Email messages with charset=windows-1252 or charset=iso-8859-1

Identifying Windows-1252 Content

The presence of bytes in the 0x80-0x9F range is a strong indicator that a file claiming to be Latin-1 is actually Windows-1252. A pure Latin-1 file should never use those byte values for printable content.

When processing legacy web content, default to trying Windows-1252 (rather than Latin-1) when the declared encoding is iso-8859-1, since this matches browser behavior and avoids misinterpreting curly quotes and dashes.

คำที่เกี่ยวข้อง