SymbolFYI

Latin-1 (ISO 8859-1)

Encoding
परिभाषा

A single-byte encoding for Western European languages covering 256 characters (U+0000–U+00FF).

Latin-1, formally defined as ISO 8859-1, is an 8-bit character encoding standard that covers Western European languages. It uses all 256 possible byte values (0x00-0xFF) and was the dominant encoding for Western European content on the web before UTF-8 became universal. Latin-1 holds a unique place in Unicode history: its 256 characters map exactly to the first 256 Unicode code points.

Character Layout

Latin-1 divides its 256 code points into four groups:

Range Decimal Content
C0 controls 0-31 Same as ASCII control characters
Basic Latin 32-127 Identical to ASCII
C1 controls 128-159 Non-printable control characters
Latin-1 supplement 160-255 Accented letters and symbols

The Latin-1 supplement (0xA0-0xFF) adds characters needed for Western European languages:

Hex Character Name
0xA0 (space) Non-breaking space
0xA9 (c) Copyright sign
0xAE (R) Registered sign
0xC0-0xD6 A-O with marks Uppercase accented letters
0xE0-0xF6 a-o with marks Lowercase accented letters
0xFF y-umlaut Latin small letter y with diaeresis

Latin-1 and Unicode

The first 256 Unicode code points are identical to Latin-1. U+00A9 is the copyright sign, U+00E9 is e-acute -- the same assignments as Latin-1 bytes 0xA9 and 0xE9. This deliberate alignment means that any Latin-1 byte can be interpreted as a Unicode code point without a lookup table.

However, Latin-1 bytes and UTF-8 bytes are not the same for the 0x80-0xFF range. In UTF-8, values 0x80-0xFF signal multi-byte sequences. The byte 0xE9 in Latin-1 is e-acute, but in UTF-8 it is the start of a 3-byte sequence:

# Latin-1 vs UTF-8 for e-acute (U+00E9)
print(b'\xe9'.decode('latin-1'))  # correct single-byte decode

try:
    b'\xe9'.decode('utf-8')  # raises UnicodeDecodeError
except UnicodeDecodeError as e:
    print(e)  # incomplete multibyte sequence

# e-acute in UTF-8 requires two bytes
print('\u00e9'.encode('utf-8').hex())   # 'c3a9'
print('\u00e9'.encode('latin-1').hex()) # 'e9'

Decoding Any Byte Sequence

Because Latin-1 maps every possible byte value to a character, it never raises a decoding error. This makes it a useful 'lossless' encoding for manipulating arbitrary binary data as text:

# Read arbitrary bytes as Latin-1 without errors
binary_data = bytes(range(256))
text = binary_data.decode('latin-1')  # always succeeds
back = text.encode('latin-1')         # lossless round-trip
print(back == binary_data)  # True

This property is exploited by the email package and some HTTP libraries when they need to pass bytes through a text interface.

Comparing with UTF-8

# Same visible character, different bytes
char = '\u00e9'  # e with acute accent

utf8_bytes   = char.encode('utf-8')
latin1_bytes = char.encode('latin-1')

print(utf8_bytes.hex())   # 'c3a9' (2 bytes)
print(latin1_bytes.hex()) # 'e9'   (1 byte)
print(len(utf8_bytes))    # 2
print(len(latin1_bytes))  # 1

Limitations

Latin-1 cannot represent characters outside its 256-character range. Languages like Polish, Czech, and Romanian require characters not in the set. Even within Western European languages, the Euro sign (U+20AC) is absent -- it was introduced in 1999 after Latin-1 was standardized. Windows-1252 added the Euro sign at byte 0x80. For any new project, UTF-8 should be used; Latin-1 appears today primarily in legacy systems, old email messages, and HTTP responses where charset=iso-8859-1 was declared.

संबंधित प्रतीक

संबंधित शब्द

संबंधित टूल