SymbolFYI

UTF-32

Encoding
定義

A fixed-width encoding using 4 bytes per character, simple but memory-intensive.

UTF-32 is a fixed-width Unicode encoding that represents every character using exactly 4 bytes (32 bits). Unlike UTF-8 and UTF-16, UTF-32 provides a direct, one-to-one mapping between code units and Unicode code points, making indexing and random access by character position extremely straightforward -- at the cost of space efficiency.

How UTF-32 Works

Every Unicode code point is stored as a 32-bit integer. Since the Unicode code point space spans U+000000 to U+10FFFF, the maximum value is 1,114,111 (0x10FFFF), which fits in 21 bits. The remaining 11 bits in the 32-bit word are always zero.

Character Code Point UTF-32 LE (hex) UTF-32 BE (hex)
A U+0041 41 00 00 00 00 00 00 41
e-acute U+00E9 E9 00 00 00 00 00 00 E9
Chinese middle U+4E2D 2D 4E 00 00 00 00 4E 2D
Emoji U+1F600 U+1F600 00 F6 01 00 00 01 F6 00

Like UTF-16, UTF-32 comes in big-endian (UTF-32 BE) and little-endian (UTF-32 LE) variants, with a BOM (U+FEFF) used to indicate byte order.

Advantages of Fixed Width

The primary advantage of UTF-32 is constant-time indexing. To find the nth character in a UTF-32 string, multiply n by 4 and read 4 bytes. This is impossible with UTF-8 or UTF-16 without scanning from the beginning:

import struct

text = 'Hello'
utf32 = text.encode('utf-32-le')

# Direct index to character 4 -- just multiply by 4
offset = 4 * 4
code_point = struct.unpack_from('<I', utf32, offset)[0]
print(hex(code_point))  # 0x6f ('o')
print(chr(code_point))  # 'o'
# Encoding comparison
text = 'Hello'
print(len(text.encode('utf-8')))    # 5 bytes
print(len(text.encode('utf-16-le')))# 10 bytes
print(len(text.encode('utf-32-le')))# 20 bytes

Space Cost

UTF-32's fixed width comes at a steep storage cost. An ASCII string that would be 1 byte per character in UTF-8 requires 4 bytes per character in UTF-32 -- a 4x overhead:

String UTF-8 UTF-16 UTF-32
Hello (5 chars) 5 bytes 10 bytes 20 bytes
2 CJK characters 6 bytes 4 bytes 8 bytes
2 emoji 8 bytes 8 bytes 8 bytes

Python's Internal Encoding

Python 3 uses a compact internal representation for strings (PEP 393). Depending on the highest code point in the string, Python chooses 1 byte (Latin-1 range), 2 bytes (BMP), or 4 bytes (supplementary) per character. When all characters are in the ASCII range, Python uses a 1-byte-per-character Latin-1 buffer. This means Python strings behave like UTF-32 conceptually (O(1) indexing by code point) without always paying the 4-byte cost.

import sys
print(sys.getsizeof('A' * 10))     # ~59 bytes (Latin-1 compact, 1 byte/char)
print(sys.getsizeof('\u0100' * 10)) # ~78 bytes (UCS-2 compact, 2 bytes/char)
print(sys.getsizeof('\U0001F600' * 10)) # ~104 bytes (UCS-4, 4 bytes/char)

When UTF-32 Is Used

UTF-32 is rarely used for file storage or transmission due to its size. It appears in:

  • Internal processing buffers where random access by code point index is required
  • Unix/Linux wchar_t: typically 4 bytes (UTF-32 LE), while Windows uses 2 bytes (UTF-16 LE)
  • Text analysis tools that need to operate character-by-character without surrogate pair complexity
  • Regular expression engines that index by code point rather than by byte offset

関連記号

関連用語

関連ツール

関連ガイド