Windows-1252 (also called CP1252 or WinLatin-1) is a character encoding developed by Microsoft as an extension of Latin-1 (ISO 8859-1). It adds 27 useful printable characters in the byte range 0x80-0x9F -- a region that Latin-1 leaves as undefined C1 control characters. Windows-1252 became ubiquitous in legacy web content produced on Windows systems and remains an important encoding to understand for handling real-world text.
The Key Difference from Latin-1
Latin-1 and Windows-1252 are identical for bytes 0x00-0x7F (ASCII) and 0xA0-0xFF. The critical difference is the 0x80-0x9F range:
- Latin-1: These 32 bytes are C1 control characters (non-printable, mostly unused)
- Windows-1252: These bytes are remapped to useful printable characters
| Byte | Windows-1252 Character | Latin-1 |
|---|---|---|
| 0x80 | Euro sign | C1 control |
| 0x82 | Single low-9 quotation mark | C1 control |
| 0x83 | Florin sign | C1 control |
| 0x84 | Double low-9 quotation mark | C1 control |
| 0x85 | Horizontal ellipsis | C1 control |
| 0x91 | Left single quotation mark | C1 control |
| 0x92 | Right single quotation mark | C1 control |
| 0x93 | Left double quotation mark | C1 control |
| 0x94 | Right double quotation mark | C1 control |
| 0x96 | En dash | C1 control |
| 0x97 | Em dash | C1 control |
| 0x99 | Trade mark sign | C1 control |
The most practically important addition is 0x80 for the Euro sign. Latin-1 has no Euro sign, which was introduced in 1999 after Latin-1 was standardized.
Browser Behavior and the HTML5 Specification
Because Windows-1252 was so prevalent on the early web, the HTML5 specification mandates that browsers treat a declared charset of iso-8859-1 (Latin-1) as Windows-1252. This means <meta charset='iso-8859-1'> is interpreted as Windows-1252 by all conformant browsers:
When a user agent decodes a byte stream labeled as
iso-8859-1, it must decode using the Windows-1252 decoder. -- WHATWG Encoding Standard
This de-facto aliasing reflects the reality that almost no web content that declared Latin-1 was actually pure Latin-1 -- it was Windows-1252.
Common Mojibake from Windows-1252
The curly quotes and dashes added in 0x80-0x9F are a frequent source of mojibake when Windows-1252 content is decoded as UTF-8. Bytes in the 0x80-0x9F range are invalid UTF-8 start bytes, so they become replacement characters or decode errors:
# Detecting and converting Windows-1252 content
bytes_cp1252 = b'He said \x93Hello\x94' # curly double quotes
print(bytes_cp1252.decode('windows-1252'))
# He said [left-double-quote]Hello[right-double-quote]
# Converting to UTF-8
utf8_output = bytes_cp1252.decode('windows-1252').encode('utf-8')
print(utf8_output.hex())
# correctly encoded UTF-8 with curly quote bytes
// Browser: TextDecoder supports 'windows-1252'
const decoder = new TextDecoder('windows-1252');
const bytes = new Uint8Array([0x93, 0x48, 0x65, 0x6c, 0x6c, 0x6f, 0x94]);
console.log(decoder.decode(bytes)); // [left-quote]Hello[right-quote]
When You Encounter Windows-1252
Windows-1252 appears frequently in:
- Legacy HTML pages without a charset declaration or with
charset=iso-8859-1 - Text files created by older versions of Notepad or Word
- CSV exports from older Windows software
- Email messages with
charset=windows-1252orcharset=iso-8859-1
Identifying Windows-1252 Content
The presence of bytes in the 0x80-0x9F range is a strong indicator that a file claiming to be Latin-1 is actually Windows-1252. A pure Latin-1 file should never use those byte values for printable content.
When processing legacy web content, default to trying Windows-1252 (rather than Latin-1) when the declared encoding is iso-8859-1, since this matches browser behavior and avoids misinterpreting curly quotes and dashes.