Unicode in URLs
URLs were originally defined to use only a limited subset of ASCII characters. Handling Unicode in URLs requires a layered system involving Internationalized Resource Identifiers (IRIs), Punycode for domain names, and percent-encoding for path and query components.
The IRI Standard (RFC 3987)
An Internationalized Resource Identifier (IRI) is the Unicode-aware superset of URLs. Where a URL permits only ASCII, an IRI permits Unicode characters in most components. Browsers and modern HTTP clients handle IRIs natively, displaying the human-readable form while converting to ASCII for transmission.
# IRI (displayed in browser address bar)
https://de.wikipedia.org/wiki/München
# URL (transmitted over the network)
https://de.wikipedia.org/wiki/M%C3%BCnchen
Percent-Encoding
Non-ASCII characters in URL paths, query strings, and fragments are represented using percent-encoding: each byte of the character's UTF-8 representation is written as %XX where XX is the hexadecimal byte value.
# 'ü' in UTF-8 is two bytes: 0xC3 0xBC
ü → %C3%BC
# '中' in UTF-8 is three bytes: 0xE4 0xB8 0xAD
中 → %E4%B8%AD
# '😀' in UTF-8 is four bytes: 0xF0 0x9F 0x98 0x80
😀 → %F0%9F%98%80
from urllib.parse import quote, unquote, urlencode
# Encode a Unicode path segment
print(quote('München')) # M%C3%BCnchen
print(quote('中文搜索')) # %E4%B8%AD%E6%96%87%E6%90%9C%E7%B4%A2
# Decode percent-encoded URL back to Unicode
print(unquote('M%C3%BCnchen')) # München
# Encode query parameters (spaces become +)
params = {'q': 'café résumé', 'lang': 'fr'}
print(urlencode(params)) # q=caf%C3%A9+r%C3%A9sum%C3%A9&lang=fr
Domain Names: Punycode
The hostname component of a URL uses Punycode encoding for non-ASCII characters (see IDN), not percent-encoding:
https://münchen.de/ → https://xn--mnchen-3ya.de/
Path and query components use percent-encoding; the host uses Punycode. They are distinct systems.
Reserved vs. Unreserved Characters
Not all ASCII characters can appear literally in URLs. RFC 3986 defines:
- Unreserved (never percent-encoded):
A-Z a-z 0-9 - _ . ~ - Reserved (have special meaning, encode if used as data):
: / ? # [ ] @ ! $ & ' ( ) * + , ; = - Everything else (must be percent-encoded)
# safe='' encodes even slashes; safe='/' preserves them
print(quote('/path/to/München', safe='/')) # /path/to/M%C3%BCnchen
JavaScript
// encodeURIComponent: encode a value for use inside a URL
console.log(encodeURIComponent('München'));
// M%C3%BCnchen
// decodeURIComponent: decode back to Unicode
console.log(decodeURIComponent('M%C3%BCnchen'));
// München
// URL API handles IRI → percent-encoded conversion automatically
const url = new URL('https://example.com/search?q=中文');
console.log(url.href); // https://example.com/search?q=%E4%B8%AD%E6%96%87
Practical Guidelines
- Always use UTF-8 as the basis for percent-encoding (RFC 3987 requirement)
- Use
urllib.parse.urlencodeorurllib.parse.quotein Python; never construct URLs by string concatenation - When parsing URLs from user input or external sources, normalize to NFC before encoding to ensure consistent representation