Unicode Encodings Explained: UTF-8, UTF-16, and UTF-32 Compared

Unicode Deep Dive Unicode Deep Dive Мар 14, 2023

○ 1. What Is Unicode? The Universal Character Standard Explained
○ 2. Unicode Planes and Blocks: How 1.1 Million Code Points Are Organized
● 3. Unicode Encodings Explained: UTF-8, UTF-16, and UTF-32 Compared
○ 4. Unicode Normalization: NFC, NFD, NFKC, and NFKD Explained
○ 5. Unicode Properties and Categories: Classifying Every Character
○ 6. Bidirectional Text in Unicode: How RTL and LTR Scripts Coexist
○ 7. How Emoji Work in Unicode: From Code Points to Skin Tones
○ 8. CJK Unification: How Unicode Handles Chinese, Japanese, and Korean
○ 9. Unicode Version History: From 1.0 to 16.0 and Beyond
○ 10. Unicode CLDR: The Database Behind Every Localized App

Содержание

Unicode defines what characters exist and assigns them code points. But a code point like U+00E9 (é) is an abstract number — before you can store it in a file or transmit it over a network, you need to decide how to represent that number as a sequence of bytes. That is what an encoding does.

Unicode comes with three standard encoding forms: UTF-8, UTF-16, and UTF-32. They all encode the same set of characters (all Unicode code points), but they make different trade-offs in terms of storage size, simplicity, and compatibility. Understanding these trade-offs is essential for any developer who handles text files, APIs, or database storage.

The Fundamental Problem

Any encoding must solve two problems:

How many bytes per code point? Unicode has 1,114,112 possible code points. To represent the highest (U+10FFFF), you need at least 21 bits — which means a minimum of 3 bytes per character in a fixed-width scheme.
How do you know where one character ends and the next begins? In a variable-length encoding, the decoder needs to figure out boundaries without external framing.

UTF-8, UTF-16, and UTF-32 answer these questions differently.

UTF-32: The Simplest Approach

UTF-32 takes the simplest possible approach: every code point uses exactly 4 bytes (32 bits).

U+0041 (A)     → 0x00000041
U+00E9 (é)     → 0x000000E9
U+4E2D (中)    → 0x00004E2D
U+1F600 (😀)   → 0x0001F600

There are no variable-length complications. The byte offset of the N-th character is always N × 4. Random access is O(1).

Pros: - Trivially simple to implement - O(1) character access by index - No surrogate pairs, no multi-byte sequences to decode

Cons: - 4 bytes per character, always — even for ASCII - A plain English text file in UTF-32 uses 4× the space of ASCII or UTF-8 - Almost no adoption on the web or in file formats

UTF-32 is occasionally used internally in Python (CPython uses it for str object storage on some platforms) and in certain Unix locale implementations. For data exchange or file storage, it is rarely the right choice.

UTF-16: The Middle Ground

UTF-16 uses 2 bytes (16 bits) for code points in the BMP (U+0000–U+FFFF), and 4 bytes (two 16-bit code units) for supplementary characters (U+10000–U+10FFFF).

BMP Characters

For the 65,536 characters in the Basic Multilingual Plane, the 16-bit code unit value equals the code point directly:

U+0041 (A)     → 0x0041
U+00E9 (é)     → 0x00E9
U+4E2D (中)    → 0x4E2D

Surrogate Pairs for Supplementary Characters

For the roughly 1,048,576 supplementary characters (planes 1–16), UTF-16 uses a mechanism called surrogate pairs. The BMP reserves two ranges specifically for this purpose:

High surrogates: U+D800–U+DBFF (1,024 code points)
Low surrogates: U+DC00–U+DFFF (1,024 code points)

A supplementary character is encoded as one high surrogate followed by one low surrogate. The code point is decoded from the pair using this formula:

code_point = 0x10000 + (high - 0xD800) × 0x400 + (low - 0xDC00)

For emoji 😀 (U+1F600):

U+1F600 = 0x1F600
offset  = 0x1F600 - 0x10000 = 0xF600

high = 0xD800 + (0xF600 >> 10)   = 0xD800 + 0x3D = 0xD83D
low  = 0xDC00 + (0xF600 & 0x3FF) = 0xDC00 + 0x200 = 0xDE00

Result: 0xD83D 0xDE00

You can verify this in JavaScript:

// JavaScript uses UTF-16 internally
const emoji = '😀';
console.log(emoji.length);          // 2 (two UTF-16 code units!)
console.log(emoji.charCodeAt(0).toString(16));  // 'd83d' (high surrogate)
console.log(emoji.charCodeAt(1).toString(16));  // 'de00' (low surrogate)

// Correct: use codePointAt() instead
console.log(emoji.codePointAt(0).toString(16)); // '1f600'
console.log(emoji.codePointAt(0));              // 128512

// Spreading to array gives correct characters
console.log([...emoji].length);     // 1 (one character)

Pros: - Good balance for CJK-heavy text (2 bytes per ideograph vs 3 bytes in UTF-8) - Used internally by Windows, JavaScript, Java, .NET, and Objective-C

Cons: - Surrogate pairs make supplementary character handling tricky - Not backward-compatible with ASCII - String length and character index operations can silently produce wrong results when supplementary characters are involved

When to use UTF-16: When you must interoperate with Windows APIs, Java's String type, or JavaScript without conversion overhead. For file storage and data exchange, UTF-8 is usually preferred.

UTF-8: The Web's Encoding

UTF-8 uses 1 to 4 bytes per code point, using a variable-length scheme that is the most space-efficient for text that is predominantly ASCII:

Code Point Range	Byte Length	Byte Pattern
U+0000–U+007F	1 byte	`0xxxxxxx`
U+0080–U+07FF	2 bytes	`110xxxxx 10xxxxxx`
U+0800–U+FFFF	3 bytes	`1110xxxx 10xxxxxx 10xxxxxx`
U+10000–U+10FFFF	4 bytes	`11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`

The x bits carry the actual code point value, packed in from the most significant bit.

Encoding Examples

U+0041 (A) — 1 byte:

Binary: 01000001
UTF-8:  0x41

U+00E9 (é) — 2 bytes:

Binary: 11101001
UTF-8 pattern: 110xxxxx 10xxxxxx
       110 00011  10 101001
UTF-8: 0xC3 0xA9

U+4E2D (中) — 3 bytes:

Binary: 0100 111000 101101
UTF-8 pattern: 1110xxxx 10xxxxxx 10xxxxxx
       1110 0100  10 111000  10 101101
UTF-8: 0xE4 0xB8 0xAD

U+1F600 (😀) — 4 bytes:

Binary: 0001 1111 0110 000000
UTF-8 pattern: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
UTF-8: 0xF0 0x9F 0x98 0x80

In Python:

# Encode and decode
text = "Hello, 中文 😀"

# Encoding to bytes
utf8_bytes  = text.encode('utf-8')
utf16_bytes = text.encode('utf-16')
utf32_bytes = text.encode('utf-32')

print(len(utf8_bytes))   # 18
print(len(utf16_bytes))  # 28 (includes 2-byte BOM)
print(len(utf32_bytes))  # 44 (includes 4-byte BOM)

# See the actual bytes
print(utf8_bytes)
# b'Hello, \xe4\xb8\xad\xe6\x96\x87 \xf0\x9f\x98\x80'

# Decoding back
decoded = utf8_bytes.decode('utf-8')
print(decoded)  # Hello, 中文 😀

Why UTF-8 Won the Web

UTF-8's dominance — it is used by over 98% of websites as of 2024 — is the result of several key properties:

ASCII compatibility: Any valid ASCII byte sequence is valid UTF-8 with identical meaning. This meant UTF-8 files could be processed by legacy ASCII software without modification for the common case of English text. No other Unicode encoding has this property.

Self-synchronizing: The byte patterns are designed so you can always tell whether any given byte is a leading byte or a continuation byte. If you lose sync (say, you start reading in the middle of a stream), you can re-sync at the next leading byte. UTF-16 with surrogates lacks this property.

No byte order issues: UTF-8 has a fixed byte order — it is just a sequence of bytes. UTF-16 and UTF-32 must specify endianness.

Space efficiency for Latin text: For text that is predominantly ASCII (English, most source code, HTML/XML), UTF-8 uses 1 byte per character — identical to ASCII. CJK-heavy text actually takes 3 bytes per ideograph in UTF-8 vs 2 in UTF-16, but this trade-off is worth it for the other benefits.

The Byte Order Mark (BOM)

When you use UTF-16 or UTF-32, you must specify whether the bytes are in big-endian or little-endian order. For UTF-16:

Big-endian (UTF-16 BE): High byte first. U+0041 → 0x00 0x41
Little-endian (UTF-16 LE): Low byte first. U+0041 → 0x41 0x00

The Byte Order Mark (BOM) is the character U+FEFF placed at the very beginning of a file or stream. Its byte representation reveals the endianness to the reader:

Encoding	BOM Bytes
UTF-8	`EF BB BF`
UTF-16 BE	`FE FF`
UTF-16 LE	`FF FE`
UTF-32 BE	`00 00 FE FF`
UTF-32 LE	`FF FE 00 00`

UTF-8 BOM is technically unnecessary — UTF-8 has no byte order — but some Windows tools (notably older versions of Notepad) write a UTF-8 BOM anyway. This causes problems: a UTF-8 BOM is invisible in text editors but breaks scripts that check the first bytes of a file, HTML parsers, and CSV imports.

Best practice: For UTF-8, do not write a BOM. If you must read files that might have one, strip it explicitly:

# Strip UTF-8 BOM if present
with open('file.txt', 'rb') as f:
    data = f.read()
    if data.startswith(b'\xef\xbb\xbf'):
        data = data[3:]
    text = data.decode('utf-8')

# Or let Python handle it:
with open('file.txt', encoding='utf-8-sig') as f:
    text = f.read()  # 'utf-8-sig' codec strips BOM automatically

In JavaScript:

// Node.js
const fs = require('fs');
let text = fs.readFileSync('file.txt', 'utf8');
// Strip BOM if present
if (text.charCodeAt(0) === 0xFEFF) {
    text = text.slice(1);
}

Encoding Detection

One persistent problem is that a byte sequence alone does not always tell you its encoding. A file containing only ASCII bytes is valid UTF-8, valid Latin-1, and valid UTF-16 BE (with every other byte being zero) — you need external information to know which encoding was intended.

Strategies for determining encoding: 1. HTTP Content-Type header: Content-Type: text/html; charset=UTF-8 2. HTML meta tag: <meta charset="UTF-8"> 3. XML declaration: <?xml version="1.0" encoding="UTF-8"?> 4. File BOM: If present, indicates encoding and endianness 5. Heuristic detection: Libraries like chardet (Python) or uchardet analyze byte patterns to guess the encoding — useful for legacy files but not reliable

The lesson: always declare your encoding explicitly. Never make users (or programs) guess.

Comparison Table

Feature	UTF-8	UTF-16	UTF-32
Bytes per ASCII char	1	2	4
Bytes per BMP char	1–3	2	4
Bytes per supplementary char	4	4	4
ASCII backward-compatible	Yes	No	No
Variable-length	Yes	Yes (surrogates)	No
Byte order issues	No	Yes	Yes
Self-synchronizing	Yes	Partial	Yes
Web adoption	~98%	Rare	Rare
Common uses	Web, files, APIs	Windows, Java, JS internals	Unix locales, internal processing

Practical Recommendations

For web development: Always use UTF-8. Declare it in your HTTP headers and HTML meta tags. Configure your database to use UTF-8 (specifically utf8mb4 in MySQL, which supports the full Unicode range including emoji — the original utf8 type in MySQL only supports the BMP).

For file I/O: UTF-8, with no BOM. Use explicit encoding declarations rather than relying on system defaults.

For APIs: UTF-8 in JSON (the JSON spec requires Unicode; UTF-8 is the default and most common in practice).

For Windows interoperability: Use UTF-16 LE when calling Windows APIs directly, but convert to UTF-8 at API boundaries when exchanging data with other systems.

For in-memory processing: Use your language's native string type. Python 3 str, Java String, JavaScript string, and C# string all handle the encoding details internally — just be aware of their internal representations when measuring string lengths or iterating characters.

Next in Series: Unicode Normalization: NFC, NFD, NFKC, and NFKD Explained — Learn why the same visible character can be encoded multiple ways, and how normalization prevents subtle bugs in string comparison and storage.