SymbolFYI

UTF-8

Encoding

التعريف

A variable-width character encoding that uses 1 to 4 bytes to represent Unicode code points. The dominant encoding on the web.

UTF-8 is the dominant character encoding on the web, used by over 98% of all websites. It is a variable-width encoding capable of representing every character in the Unicode standard while remaining fully backward-compatible with ASCII. Its design makes it the default choice for nearly all modern software, file formats, and network protocols.

How UTF-8 Works

UTF-8 encodes each Unicode code point using 1 to 4 bytes. The number of bytes required depends on the code point's value:

Code Point Range	Bytes Used	Example Character
U+0000 - U+007F	1 byte	`A` (U+0041)
U+0080 - U+07FF	2 bytes	`e` (U+00E9)
U+0800 - U+FFFF	3 bytes	`Zh` (U+4E2D)
U+10000 - U+10FFFF	4 bytes	emoji (U+1F600)

For ASCII characters (U+0000-U+007F), UTF-8 uses a single byte identical to the ASCII value. This backward-compatibility means that any ASCII text is also valid UTF-8.

Encoding Structure

The leading byte of a multi-byte sequence signals how many bytes follow:

0xxxxxxx - 1-byte sequence (ASCII)
110xxxxx 10xxxxxx - 2-byte sequence
1110xxxx 10xxxxxx 10xxxxxx - 3-byte sequence
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx - 4-byte sequence

Continuation bytes always start with 10, making it easy to re-synchronize after a read error.

Encoding and Decoding in Code

# Python: encoding and decoding UTF-8
text = 'Hello, world'
bytes_data = text.encode('utf-8')
print(bytes_data)  # b'Hello, ...'
print(bytes_data.decode('utf-8'))

# Check byte length vs character length
print(len(text))        # character count
print(len(bytes_data))  # byte count

// JavaScript: using TextEncoder / TextDecoder
const encoder = new TextEncoder(); // always UTF-8
const bytes = encoder.encode('Hello');
console.log(bytes.length);

const decoder = new TextDecoder('utf-8');
console.log(decoder.decode(bytes));

Why UTF-8 Won

UTF-8 succeeded over other encodings for several reasons. Its ASCII compatibility meant existing systems and tools continued to work without modification. Its self-synchronizing byte structure makes it robust -- a corrupted or missing byte does not cause the rest of the stream to be misread. It is also space-efficient for Latin-script content: English text takes the same space as ASCII.

The IETF requires UTF-8 as the default encoding for new internet protocols. HTML5 specifies UTF-8 as the recommended encoding for all web pages. The <meta charset='utf-8'> tag in an HTML document's <head> instructs browsers to interpret the page as UTF-8.

Common Pitfalls

Mixing encodings is the most frequent source of text corruption. If a file saved as UTF-8 is opened with a Latin-1 decoder, multi-byte sequences are misinterpreted. Always declare and consistently use UTF-8 throughout your stack -- from the database connection charset to the HTTP Content-Type header.

المصطلحات ذات الصلة