SymbolFYI

Unicode in URLs & IRIs

Programming & Dev
Định nghĩa

How Unicode characters in URLs are handled: IRI (RFC 3987), percent-encoding of UTF-8 bytes, and browser display.

Unicode in URLs

URLs were originally defined to use only a limited subset of ASCII characters. Handling Unicode in URLs requires a layered system involving Internationalized Resource Identifiers (IRIs), Punycode for domain names, and percent-encoding for path and query components.

The IRI Standard (RFC 3987)

An Internationalized Resource Identifier (IRI) is the Unicode-aware superset of URLs. Where a URL permits only ASCII, an IRI permits Unicode characters in most components. Browsers and modern HTTP clients handle IRIs natively, displaying the human-readable form while converting to ASCII for transmission.

# IRI (displayed in browser address bar)
https://de.wikipedia.org/wiki/München

# URL (transmitted over the network)
https://de.wikipedia.org/wiki/M%C3%BCnchen

Percent-Encoding

Non-ASCII characters in URL paths, query strings, and fragments are represented using percent-encoding: each byte of the character's UTF-8 representation is written as %XX where XX is the hexadecimal byte value.

# 'ü' in UTF-8 is two bytes: 0xC3 0xBC
ü → %C3%BC

# '中' in UTF-8 is three bytes: 0xE4 0xB8 0xAD
中 → %E4%B8%AD

# '😀' in UTF-8 is four bytes: 0xF0 0x9F 0x98 0x80
😀 → %F0%9F%98%80
from urllib.parse import quote, unquote, urlencode

# Encode a Unicode path segment
print(quote('München'))       # M%C3%BCnchen
print(quote('中文搜索'))       # %E4%B8%AD%E6%96%87%E6%90%9C%E7%B4%A2

# Decode percent-encoded URL back to Unicode
print(unquote('M%C3%BCnchen'))  # München

# Encode query parameters (spaces become +)
params = {'q': 'café résumé', 'lang': 'fr'}
print(urlencode(params))  # q=caf%C3%A9+r%C3%A9sum%C3%A9&lang=fr

Domain Names: Punycode

The hostname component of a URL uses Punycode encoding for non-ASCII characters (see IDN), not percent-encoding:

https://münchen.de/  →  https://xn--mnchen-3ya.de/

Path and query components use percent-encoding; the host uses Punycode. They are distinct systems.

Reserved vs. Unreserved Characters

Not all ASCII characters can appear literally in URLs. RFC 3986 defines:

  • Unreserved (never percent-encoded): A-Z a-z 0-9 - _ . ~
  • Reserved (have special meaning, encode if used as data): : / ? # [ ] @ ! $ & ' ( ) * + , ; =
  • Everything else (must be percent-encoded)
# safe='' encodes even slashes; safe='/' preserves them
print(quote('/path/to/München', safe='/'))  # /path/to/M%C3%BCnchen

JavaScript

// encodeURIComponent: encode a value for use inside a URL
console.log(encodeURIComponent('München'));
// M%C3%BCnchen

// decodeURIComponent: decode back to Unicode
console.log(decodeURIComponent('M%C3%BCnchen'));
// München

// URL API handles IRI → percent-encoded conversion automatically
const url = new URL('https://example.com/search?q=中文');
console.log(url.href);  // https://example.com/search?q=%E4%B8%AD%E6%96%87

Practical Guidelines

  • Always use UTF-8 as the basis for percent-encoding (RFC 3987 requirement)
  • Use urllib.parse.urlencode or urllib.parse.quote in Python; never construct URLs by string concatenation
  • When parsing URLs from user input or external sources, normalize to NFC before encoding to ensure consistent representation

Ký hiệu liên quan

Thuật ngữ liên quan

Công cụ liên quan

Hướng dẫn liên quan