SymbolFYI

Unicode in URLs & IRIs

Programming & Dev

Định nghĩa

How Unicode characters in URLs are handled: IRI (RFC 3987), percent-encoding of UTF-8 bytes, and browser display.

Unicode in URLs

URLs were originally defined to use only a limited subset of ASCII characters. Handling Unicode in URLs requires a layered system involving Internationalized Resource Identifiers (IRIs), Punycode for domain names, and percent-encoding for path and query components.

The IRI Standard (RFC 3987)

An Internationalized Resource Identifier (IRI) is the Unicode-aware superset of URLs. Where a URL permits only ASCII, an IRI permits Unicode characters in most components. Browsers and modern HTTP clients handle IRIs natively, displaying the human-readable form while converting to ASCII for transmission.

# IRI (displayed in browser address bar)
https://de.wikipedia.org/wiki/München

# URL (transmitted over the network)
https://de.wikipedia.org/wiki/M%C3%BCnchen

Percent-Encoding

Non-ASCII characters in URL paths, query strings, and fragments are represented using percent-encoding: each byte of the character's UTF-8 representation is written as %XX where XX is the hexadecimal byte value.

# 'ü' in UTF-8 is two bytes: 0xC3 0xBC
ü → %C3%BC

# '中' in UTF-8 is three bytes: 0xE4 0xB8 0xAD
中 → %E4%B8%AD

# '😀' in UTF-8 is four bytes: 0xF0 0x9F 0x98 0x80
😀 → %F0%9F%98%80

from urllib.parse import quote, unquote, urlencode

# Encode a Unicode path segment
print(quote('München'))       # M%C3%BCnchen
print(quote('中文搜索'))       # %E4%B8%AD%E6%96%87%E6%90%9C%E7%B4%A2

# Decode percent-encoded URL back to Unicode
print(unquote('M%C3%BCnchen'))  # München

# Encode query parameters (spaces become +)
params = {'q': 'café résumé', 'lang': 'fr'}
print(urlencode(params))  # q=caf%C3%A9+r%C3%A9sum%C3%A9&lang=fr

Domain Names: Punycode

The hostname component of a URL uses Punycode encoding for non-ASCII characters (see IDN), not percent-encoding:

https://münchen.de/  →  https://xn--mnchen-3ya.de/

Path and query components use percent-encoding; the host uses Punycode. They are distinct systems.

Reserved vs. Unreserved Characters

Not all ASCII characters can appear literally in URLs. RFC 3986 defines:

Unreserved (never percent-encoded): A-Z a-z 0-9 - _ . ~
Reserved (have special meaning, encode if used as data): : / ? # [ ] @ ! $ & ' ( ) * + , ; =
Everything else (must be percent-encoded)

# safe='' encodes even slashes; safe='/' preserves them
print(quote('/path/to/München', safe='/'))  # /path/to/M%C3%BCnchen

JavaScript

// encodeURIComponent: encode a value for use inside a URL
console.log(encodeURIComponent('München'));
// M%C3%BCnchen

// decodeURIComponent: decode back to Unicode
console.log(decodeURIComponent('M%C3%BCnchen'));
// München

// URL API handles IRI → percent-encoded conversion automatically
const url = new URL('https://example.com/search?q=中文');
console.log(url.href);  // https://example.com/search?q=%E4%B8%AD%E6%96%87

Practical Guidelines

Always use UTF-8 as the basis for percent-encoding (RFC 3987 requirement)
Use urllib.parse.urlencode or urllib.parse.quote in Python; never construct URLs by string concatenation
When parsing URLs from user input or external sources, normalize to NFC before encoding to ensure consistent representation

Unicode in URLs & IRIs

Unicode in URLs

The IRI Standard (RFC 3987)

Percent-Encoding

Domain Names: Punycode

Reserved vs. Unreserved Characters

JavaScript

Practical Guidelines

Ký hiệu liên quan

Thuật ngữ liên quan

Công cụ liên quan

Hướng dẫn liên quan