SymbolFYI

Windows-1252

Encoding
정의

A superset of Latin-1 used by default in legacy Windows applications, with extra characters in the 0x80–0x9F range.

Windows-1252 (also called CP1252 or WinLatin-1) is a character encoding developed by Microsoft as an extension of Latin-1 (ISO 8859-1). It adds 27 useful printable characters in the byte range 0x80-0x9F -- a region that Latin-1 leaves as undefined C1 control characters. Windows-1252 became ubiquitous in legacy web content produced on Windows systems and remains an important encoding to understand for handling real-world text.

The Key Difference from Latin-1

Latin-1 and Windows-1252 are identical for bytes 0x00-0x7F (ASCII) and 0xA0-0xFF. The critical difference is the 0x80-0x9F range:

  • Latin-1: These 32 bytes are C1 control characters (non-printable, mostly unused)
  • Windows-1252: These bytes are remapped to useful printable characters
Byte Windows-1252 Character Latin-1
0x80 Euro sign C1 control
0x82 Single low-9 quotation mark C1 control
0x83 Florin sign C1 control
0x84 Double low-9 quotation mark C1 control
0x85 Horizontal ellipsis C1 control
0x91 Left single quotation mark C1 control
0x92 Right single quotation mark C1 control
0x93 Left double quotation mark C1 control
0x94 Right double quotation mark C1 control
0x96 En dash C1 control
0x97 Em dash C1 control
0x99 Trade mark sign C1 control

The most practically important addition is 0x80 for the Euro sign. Latin-1 has no Euro sign, which was introduced in 1999 after Latin-1 was standardized.

Browser Behavior and the HTML5 Specification

Because Windows-1252 was so prevalent on the early web, the HTML5 specification mandates that browsers treat a declared charset of iso-8859-1 (Latin-1) as Windows-1252. This means <meta charset='iso-8859-1'> is interpreted as Windows-1252 by all conformant browsers:

When a user agent decodes a byte stream labeled as iso-8859-1, it must decode using the Windows-1252 decoder. -- WHATWG Encoding Standard

This de-facto aliasing reflects the reality that almost no web content that declared Latin-1 was actually pure Latin-1 -- it was Windows-1252.

Common Mojibake from Windows-1252

The curly quotes and dashes added in 0x80-0x9F are a frequent source of mojibake when Windows-1252 content is decoded as UTF-8. Bytes in the 0x80-0x9F range are invalid UTF-8 start bytes, so they become replacement characters or decode errors:

# Detecting and converting Windows-1252 content
bytes_cp1252 = b'He said \x93Hello\x94'  # curly double quotes
print(bytes_cp1252.decode('windows-1252'))
# He said [left-double-quote]Hello[right-double-quote]

# Converting to UTF-8
utf8_output = bytes_cp1252.decode('windows-1252').encode('utf-8')
print(utf8_output.hex())
# correctly encoded UTF-8 with curly quote bytes
// Browser: TextDecoder supports 'windows-1252'
const decoder = new TextDecoder('windows-1252');
const bytes = new Uint8Array([0x93, 0x48, 0x65, 0x6c, 0x6c, 0x6f, 0x94]);
console.log(decoder.decode(bytes));  // [left-quote]Hello[right-quote]

When You Encounter Windows-1252

Windows-1252 appears frequently in:

  • Legacy HTML pages without a charset declaration or with charset=iso-8859-1
  • Text files created by older versions of Notepad or Word
  • CSV exports from older Windows software
  • Email messages with charset=windows-1252 or charset=iso-8859-1

Identifying Windows-1252 Content

The presence of bytes in the 0x80-0x9F range is a strong indicator that a file claiming to be Latin-1 is actually Windows-1252. A pure Latin-1 file should never use those byte values for printable content.

When processing legacy web content, default to trying Windows-1252 (rather than Latin-1) when the declared encoding is iso-8859-1, since this matches browser behavior and avoids misinterpreting curly quotes and dashes.

관련 기호

관련 용어

관련 도구