Unicode in URLs: Percent-Encoding, Punycode, and IRIs

Web Development Symbols for Developers 4月 16, 2024

○ 1. HTML Entities: The Complete Guide to Character References
○ 2. CSS Content Property: Using Unicode Symbols in Stylesheets
○ 3. Unicode-Aware Regex: Property Escapes and Multilingual Patterns
○ 4. JavaScript and Unicode: Strings, Code Points, and Grapheme Clusters
○ 5. Python and Unicode: The Complete Developer's Guide
● 6. Unicode in URLs: Percent-Encoding, Punycode, and IRIs
○ 7. IDN Homograph Attacks: When Unicode Becomes a Security Threat
○ 8. Web Fonts and Unicode Subsetting: Loading Only What You Need
○ 9. Character Encoding Detection: How to Identify Unknown Text Encoding
○ 10. Unicode Collation: How to Sort Text Correctly Across Languages

URLs were designed for ASCII. When the web went global, two separate mechanisms emerged to carry Unicode through ASCII-only infrastructure: percent-encoding for paths and query strings, and Punycode for domain names. Understanding both — and knowing which encoding applies where — prevents some of the most confusing bugs in web development.

The Anatomy of a URL and Where Unicode Can Appear

A URL has several distinct components, each with its own encoding rules:

https://münchen.de/straße?q=über+alles&lang=de#einführung
         ^^^^^^^^   ^^^^^^  ^^^^^^^^^^^^          ^^^^^^^^^^^
         domain      path    query                  fragment
         (Punycode)  (%-enc) (%-enc)               (%-enc)

The scheme (https), port, and the structural characters (:, /, ?, #, &, =) are always ASCII. Everything else may contain non-ASCII characters and must be encoded for safe transport.

Percent-Encoding (URL Encoding)

Percent-encoding replaces a byte with % followed by two uppercase hex digits. Because URLs are transmitted as bytes, non-ASCII characters must first be encoded as UTF-8, then each byte is percent-encoded:

€  →  UTF-8 bytes: 0xE2 0x82 0xAC  →  %E2%82%AC
字  →  UTF-8 bytes: 0xE5 0xAD 0x97  →  %E5%AD%97
😀 →  UTF-8 bytes: 0xF0 0x9F 0x98 0x80  →  %F0%9F%98%80

Characters that must be percent-encoded

RFC 3986 defines two categories:

Reserved characters — have special meaning in URLs and must be encoded when used as data:

: / ? # [ ] @ ! $ & ' ( ) * + , ; =

Unreserved characters — safe to use unencoded anywhere:

A-Z a-z 0-9 - _ . ~

Everything else — spaces, non-ASCII characters, and most symbols — must be encoded.

Spaces: `%20` vs. `+`

This is a persistent source of confusion:

%20 — the correct percent-encoding of U+0020 SPACE in any URL component
+ — shorthand for space in application/x-www-form-urlencoded format (HTML form submissions) only

from urllib.parse import quote, quote_plus, urlencode

# quote: for path and query value encoding (space → %20)
quote("hello world")         # 'hello%20world'
quote("café")                # 'caf%C3%A9'
quote("/path/to/file", safe='/') # '/path/to/file' — / is safe

# quote_plus: for HTML form encoding (space → +)
quote_plus("hello world")    # 'hello+world'
quote_plus("café")           # 'caf%C3%A9'

# urlencode: for building query strings (uses quote_plus)
urlencode({'q': 'über alles', 'lang': 'de'})
# 'q=%C3%BCber+alles&lang=de'

// JavaScript equivalents:
// encodeURIComponent: encodes everything except A-Z a-z 0-9 - _ . ! ~ * ' ( )
encodeURIComponent("hello world")   // 'hello%20world'
encodeURIComponent("café")          // 'caf%C3%A9'
encodeURIComponent("a=b&c=d")       // 'a%3Db%26c%3Dd' — encodes = and &

// encodeURI: does NOT encode structural characters (: / ? # @ etc.)
encodeURI("https://example.com/path with spaces")
// 'https://example.com/path%20with%20spaces'

// URLSearchParams: correct form encoding
const params = new URLSearchParams({ q: 'über alles', lang: 'de' });
params.toString()  // 'q=%C3%BCber+alles&lang=de'

// The URL constructor handles encoding automatically:
const url = new URL("https://example.com");
url.pathname = "/über/alles";  // stores internally, encodes on output
url.href  // 'https://example.com/%C3%BCber/alles'

Do not use `encodeURI` on components

// Wrong: encodeURI doesn't encode characters that are special in URLs
encodeURI("https://example.com?q=a&b=c#frag")
// 'https://example.com?q=a&b=c#frag' — unchanged, but that's the whole URL

// Wrong for individual components:
const query = encodeURI("rock & roll")
// 'rock%20&%20roll' — & is NOT encoded, breaks query string parsing

// Correct for components:
const query2 = encodeURIComponent("rock & roll")
// 'rock%20%26%20roll' — & is encoded correctly

The rule: use encodeURIComponent for individual URL components (path segments, query values, fragment). Use encodeURI only for encoding a complete URL that you know is structurally correct and only needs path/query-level encoding.

IRIs: Internationalized Resource Identifiers

An IRI (RFC 3987) extends URI syntax to allow non-ASCII characters directly:

https://münchen.de/straße?q=über

IRIs allow Unicode characters in most positions without percent-encoding. However, IRIs must be converted to URIs (ASCII-only) before transmission over legacy infrastructure. Modern browsers display the IRI form in the address bar but transmit the URI form.

The browser's address bar shows you an IRI; the actual network request uses the percent-encoded URI. This distinction matters when you are parsing URLs from user input or building URLs that will be displayed.

Punycode and Internationalized Domain Names (IDN)

Domain names have a stricter requirement: they must be pure ASCII in DNS queries. Internationalized domain names (IDN) use Punycode (RFC 3492) to encode Unicode domain labels as ASCII.

The ACE (ASCII Compatible Encoding) prefix is xn--. A Punycode-encoded label begins with xn-- followed by the ASCII-transliterable parts, then --, then the encoded non-ASCII parts:

münchen.de  →  xn--mnchen-3ya.de
日本語.jp     →  xn--wgv71a309e.jp
中文.com     →  xn--fiq228c.com
bücher.de   →  xn--bcher-kva.de

# Python: encodings.idna or the 'idna' package
import encodings.idna

# Encode a domain label to Punycode
encodings.idna.nameprep("münchen")      # 'münchen' (case-folded)

# Use the 'idna' package for full IDNA 2008 compliance:
# pip install idna
import idna

idna.encode("münchen.de")              # b'xn--mnchen-3ya.de'
idna.decode("xn--mnchen-3ya.de")       # 'münchen.de'
idna.encode("日本語.jp")               # b'xn--wgv71a309e.jp'

# Encode a full URL's host:
from urllib.parse import urlparse, urlunparse
def encode_url_host(url: str) -> str:
    parsed = urlparse(url)
    try:
        encoded_host = idna.encode(parsed.hostname, alabel_round_trip=True).decode('ascii')
    except idna.core.InvalidCodepoint:
        encoded_host = parsed.hostname  # fallback if not a valid IDN
    return urlunparse(parsed._replace(netloc=encoded_host))

encode_url_host("https://münchen.de/straße")
# 'https://xn--mnchen-3ya.de/stra%C3%9Fe'

// JavaScript: URL constructor handles IDN automatically
new URL("https://münchen.de/straße").href
// 'https://xn--mnchen-3ya.de/stra%C3%9Fe'

// Display the international form:
new URL("https://xn--mnchen-3ya.de").hostname
// 'münchen.de' — decoded for display

IDNA 2003 vs. IDNA 2008

There are two versions of the IDN standard:

IDNA 2003 (RFC 3490) — used by older systems; maps some characters using Unicode 3.2 compatibility mappings
IDNA 2008 (RFC 5891) — stricter; rejects some mappings IDNA 2003 allowed; what most registrars use now

The differences are subtle but matter for specific characters like ß (sharp s), which IDNA 2003 maps to ss but IDNA 2008 allows as-is as xn--zca.example. Use the idna package's default (IDNA 2008) for new code.

Building URLs Correctly

Always use URL builder APIs rather than string concatenation:

from urllib.parse import urljoin, urlparse, urlunparse, urlencode, quote

# Building a URL from components:
def build_url(
    base: str,
    path: str,
    query_params: dict[str, str] | None = None,
    fragment: str = "",
) -> str:
    parsed = urlparse(base)
    safe_path = '/'.join(quote(segment, safe='') for segment in path.split('/'))
    query = urlencode(query_params or {}, quote_via=quote)
    return urlunparse((
        parsed.scheme,
        parsed.netloc,
        safe_path,
        '',           # params (semicolon-separated, rarely used)
        query,
        quote(fragment, safe='')
    ))

build_url(
    "https://example.com",
    "/search/über alles",
    {"q": "café & résumé", "lang": "de"},
    "section 1"
)
# 'https://example.com/search/%C3%BCber%20alles?q=caf%C3%A9%20%26%20r%C3%A9sum%C3%A9&lang=de#section%201'

// JavaScript: compose URLs with the URL API
function buildUrl(base, path, params = {}, fragment = '') {
  const url = new URL(base);
  url.pathname = path;  // URL API handles encoding
  Object.entries(params).forEach(([k, v]) => url.searchParams.set(k, v));
  url.hash = fragment;
  return url.href;
}

buildUrl(
  'https://example.com',
  '/über/alles',
  { q: 'café & résumé', lang: 'de' },
  'section 1'
);
// 'https://example.com/%C3%BCber/alles?q=caf%C3%A9+%26+r%C3%A9sum%C3%A9&lang=de#section%201'

Parsing and Decoding URLs

When consuming URLs, always decode before displaying to users:

from urllib.parse import urlparse, unquote, parse_qs

raw_url = "https://example.com/search/%C3%BCber%20alles?q=caf%C3%A9&lang=de"
parsed = urlparse(raw_url)

# Decode for display
display_path = unquote(parsed.path)     # '/search/über alles'
query = parse_qs(parsed.query)
# {'q': ['café'], 'lang': ['de']} — parse_qs decodes automatically

// JavaScript: URL API decodes automatically
const url = new URL("https://example.com/search/%C3%BCber%20alles?q=caf%C3%A9");
url.pathname          // '/search/über alles' — decoded
url.searchParams.get('q')  // 'café' — decoded

// Manual decoding:
decodeURIComponent('%C3%BCber%20alles')  // 'über alles'
decodeURI('https://example.com/%C3%BCber')  // 'https://example.com/über'

Common Pitfalls

Double-encoding: Encoding an already-encoded URL component produces %2520 (% → %25). Always encode raw values, never already-encoded strings.

# Wrong
encoded = quote("über")     # '%C3%BCber'
double_encoded = quote(encoded)  # '%25C3%25BCber' — wrong!

# Correct: encode once, at the boundary
raw_value = "über"
url = f"/search/{quote(raw_value, safe='')}"

Query string + vs %20: A + in a URL query string means space in application/x-www-form-urlencoded. A literal + character must be encoded as %2B. When parsing query strings with your own code (rather than standard library functions), be sure to convert + to space before decoding.

Relative URL resolution: urljoin and the URL constructor handle relative URL resolution. Percent-encoded components in base URLs can cause unexpected behavior if not handled carefully.

Handling Unicode in Redirects and HTTP Headers

HTTP headers are ASCII-only by specification. URLs in Location redirect headers must be ASCII, which means percent-encoding all non-ASCII path and query characters:

from django.http import HttpResponseRedirect
from urllib.parse import quote

def redirect_to_search(request):
    query = request.GET.get('q', '')
    # Correct: encode before placing in redirect URL
    safe_query = quote(query, safe='')
    return HttpResponseRedirect(f'/search/?q={safe_query}')

// Express.js: res.redirect handles encoding automatically when using the URL API
app.get('/go', (req, res) => {
  const url = new URL('https://example.com/search');
  url.searchParams.set('q', req.query.q || '');  // URL API encodes automatically
  res.redirect(url.href);
});

For Content-Disposition headers (file downloads), filenames with non-ASCII characters require the filename* parameter with RFC 5987 encoding:

# RFC 5987 encoding for non-ASCII filenames
import urllib.parse

def download_response(response, filename: str):
    ascii_fallback = filename.encode('ascii', errors='replace').decode('ascii')
    encoded_name = urllib.parse.quote(filename, safe='')
    response['Content-Disposition'] = (
        f"attachment; filename=\"{ascii_fallback}\"; "
        f"filename*=UTF-8''{encoded_name}"
    )
    return response

download_response(response, "报告_2024.pdf")
# Content-Disposition: attachment; filename="??_2024.pdf";
#   filename*=UTF-8''%E6%8A%A5%E5%91%8A_2024.pdf

Fragment Identifiers and Non-ASCII

The URL fragment (#section) is percent-encoded in the same way as paths and query values, but it is handled differently: the fragment is never sent to the server — it exists only in the browser. This means non-ASCII fragments are safe to use for in-page navigation but cannot be processed server-side:

<!-- Valid: browser percent-encodes automatically when navigating -->
<a href="#über-uns">About Us</a>

<!-- The URL bar shows: example.com/page#über-uns -->
<!-- The actual request to the server is: example.com/page -->
<!-- No fragment is transmitted -->

// Reading the fragment in JavaScript:
window.location.hash          // '#%C3%BCber-uns' or '#über-uns' (browser-dependent)
decodeURIComponent(window.location.hash.slice(1))  // 'über-uns'

Canonical URLs and Normalization

When the same resource can be reached at multiple URLs (through mixed percent-encoding or different normalization), set a canonical URL to avoid duplicate content issues:

def normalize_url(url: str) -> str:
    """
    Normalize a URL to its canonical form:
    - lowercase scheme and host
    - percent-encode non-ASCII characters
    - decode unnecessarily encoded unreserved characters
    - normalize path (remove . and ..)
    """
    from urllib.parse import urlparse, urlunparse, quote, unquote
    import posixpath

    parsed = urlparse(url)

    # Lowercase scheme and host
    scheme = parsed.scheme.lower()
    netloc = parsed.netloc.lower()

    # Normalize path: decode unreserved chars, re-encode the rest
    # Unreserved: A-Z a-z 0-9 - _ . ~
    path = quote(unquote(parsed.path), safe='/-._~')

    # Normalize path segments (remove . and ..)
    path = posixpath.normpath(path) if path else '/'

    query = quote(unquote(parsed.query), safe='=&+%')
    fragment = quote(unquote(parsed.fragment), safe='')

    return urlunparse((scheme, netloc, path, '', query, fragment))

# Examples:
normalize_url("HTTPS://EXAMPLE.COM/path/../other/%61bc")
# 'https://example.com/other/abc'  — uppercase decoded, . removed, %61 (a) decoded

normalize_url("https://example.com/über%20alles")
# 'https://example.com/%C3%BCber%20alles'  — ü properly encoded

Use the SymbolFYI Encoding Converter to quickly find the percent-encoded form of any Unicode character, or verify a full encoding chain from Unicode code point to UTF-8 bytes to percent-encoded form.

Next in Series: IDN Homograph Attacks: When Unicode Becomes a Security Threat — how attackers exploit Unicode lookalike characters to create convincing phishing domains, and the browser and application-level defenses against them.