UTF-8 is the dominant character encoding on the web, used by over 98% of all websites. It is a variable-width encoding capable of representing every character in the Unicode standard while remaining fully backward-compatible with ASCII. Its design makes it the default choice for nearly all modern software, file formats, and network protocols.
How UTF-8 Works
UTF-8 encodes each Unicode code point using 1 to 4 bytes. The number of bytes required depends on the code point's value:
| Code Point Range | Bytes Used | Example Character |
|---|---|---|
| U+0000 - U+007F | 1 byte | A (U+0041) |
| U+0080 - U+07FF | 2 bytes | e (U+00E9) |
| U+0800 - U+FFFF | 3 bytes | Zh (U+4E2D) |
| U+10000 - U+10FFFF | 4 bytes | emoji (U+1F600) |
For ASCII characters (U+0000-U+007F), UTF-8 uses a single byte identical to the ASCII value. This backward-compatibility means that any ASCII text is also valid UTF-8.
Encoding Structure
The leading byte of a multi-byte sequence signals how many bytes follow:
0xxxxxxx- 1-byte sequence (ASCII)110xxxxx 10xxxxxx- 2-byte sequence1110xxxx 10xxxxxx 10xxxxxx- 3-byte sequence11110xxx 10xxxxxx 10xxxxxx 10xxxxxx- 4-byte sequence
Continuation bytes always start with 10, making it easy to re-synchronize after a read error.
Encoding and Decoding in Code
# Python: encoding and decoding UTF-8
text = 'Hello, world'
bytes_data = text.encode('utf-8')
print(bytes_data) # b'Hello, ...'
print(bytes_data.decode('utf-8'))
# Check byte length vs character length
print(len(text)) # character count
print(len(bytes_data)) # byte count
// JavaScript: using TextEncoder / TextDecoder
const encoder = new TextEncoder(); // always UTF-8
const bytes = encoder.encode('Hello');
console.log(bytes.length);
const decoder = new TextDecoder('utf-8');
console.log(decoder.decode(bytes));
Why UTF-8 Won
UTF-8 succeeded over other encodings for several reasons. Its ASCII compatibility meant existing systems and tools continued to work without modification. Its self-synchronizing byte structure makes it robust -- a corrupted or missing byte does not cause the rest of the stream to be misread. It is also space-efficient for Latin-script content: English text takes the same space as ASCII.
The IETF requires UTF-8 as the default encoding for new internet protocols. HTML5 specifies UTF-8 as the recommended encoding for all web pages. The <meta charset='utf-8'> tag in an HTML document's <head> instructs browsers to interpret the page as UTF-8.
Common Pitfalls
Mixing encodings is the most frequent source of text corruption. If a file saved as UTF-8 is opened with a Latin-1 decoder, multi-byte sequences are misinterpreted. Always declare and consistently use UTF-8 throughout your stack -- from the database connection charset to the HTTP Content-Type header.