Python and Unicode: The Complete Developer's Guide

Web Development Symbols for Developers Abr 2, 2024

○ 1. HTML Entities: The Complete Guide to Character References
○ 2. CSS Content Property: Using Unicode Symbols in Stylesheets
○ 3. Unicode-Aware Regex: Property Escapes and Multilingual Patterns
○ 4. JavaScript and Unicode: Strings, Code Points, and Grapheme Clusters
● 5. Python and Unicode: The Complete Developer's Guide
○ 6. Unicode in URLs: Percent-Encoding, Punycode, and IRIs
○ 7. IDN Homograph Attacks: When Unicode Becomes a Security Threat
○ 8. Web Fonts and Unicode Subsetting: Loading Only What You Need
○ 9. Character Encoding Detection: How to Identify Unknown Text Encoding
○ 10. Unicode Collation: How to Sort Text Correctly Across Languages

Índice

Python 3 made the right call: strings are Unicode by default. The str type is a sequence of Unicode code points, not bytes. There is no ambiguity between text and binary data — they are different types with incompatible operations. This design eliminates the most common class of encoding bugs from Python 2, but Unicode still requires deliberate handling at the boundaries where text enters and leaves your program.

`str` vs. `bytes`: The Fundamental Distinction

# str: sequence of Unicode code points
text = "Hello, 世界 🌍"
type(text)      # <class 'str'>
len(text)       # 10 — code points, not bytes

# bytes: sequence of raw bytes
data = b"Hello"
type(data)      # <class 'bytes'>
len(data)       # 5

# You cannot mix them:
"text" + b"bytes"  # TypeError: can only concatenate str (not "bytes") to str
"text".encode()    # b'text' — str → bytes
b"bytes".decode()  # 'bytes' — bytes → str

In Python 3, every string operation works on Unicode code points. You never need to think about UTF-8 when manipulating strings — only when reading from or writing to external sources (files, network, databases).

The Unicode Sandwich Pattern

The most important principle for text processing in Python: decode at the input boundary, process as str, encode at the output boundary. Your core logic should only ever see str, never bytes.

bytes (input) → decode() → [str processing] → encode() → bytes (output)
       ↑                                                         ↑
   "Decode here"                                          "Encode here"
   (file read, HTTP response, DB result)                  (file write, HTTP request, DB insert)

# Wrong: passing bytes through business logic
def count_words(data: bytes) -> int:
    return len(data.split(b' '))  # fails on non-ASCII

# Correct: decode at the boundary
def count_words(text: str) -> int:
    return len(text.split())  # works for any language

# At the input boundary:
with open('document.txt', 'rb') as f:
    raw = f.read()
text = raw.decode('utf-8')         # decode once, at the edge
word_count = count_words(text)     # process as str

Encoding and Decoding

`encode()` and `decode()`

text = "Héllo wörld"

# str.encode() → bytes
utf8_bytes  = text.encode('utf-8')   # b'H\xc3\xa9llo w\xc3\xb6rld'
utf16_bytes = text.encode('utf-16')  # includes BOM by default
latin1_bytes = text.encode('latin-1') # b'H\xe9llo w\xf6rld'

# bytes.decode() → str
text_back = utf8_bytes.decode('utf-8')  # "Héllo wörld"

# Error handling:
# 'strict' (default): raise UnicodeDecodeError
# 'ignore': silently drop invalid bytes
# 'replace': replace with U+FFFD REPLACEMENT CHARACTER
# 'backslashreplace': escape with \uXXXX notation

broken_bytes = b'\xff\xfe Hello'
broken_bytes.decode('utf-8', errors='replace')   # '  Hello'
broken_bytes.decode('utf-8', errors='ignore')    # ' Hello'
broken_bytes.decode('utf-8', errors='backslashreplace')  # '\\xff\\xfe Hello'

File I/O

Always specify encoding explicitly. The default (locale.getpreferredencoding()) varies by platform and can produce code that works on macOS but fails on Windows:

# Wrong: relies on system default encoding
with open('file.txt', 'r') as f:
    content = f.read()

# Correct: explicit encoding
with open('file.txt', 'r', encoding='utf-8') as f:
    content = f.read()

# For files with unknown encoding, use errors='replace' or detect first:
with open('unknown.txt', 'r', encoding='utf-8', errors='replace') as f:
    content = f.read()

# Writing: explicit encoding prevents platform-specific issues
with open('output.txt', 'w', encoding='utf-8') as f:
    f.write("Hello, 世界\n")

# Binary mode + explicit encode/decode (most explicit):
with open('output.txt', 'wb') as f:
    f.write("Hello, 世界\n".encode('utf-8'))

The `unicodedata` Module

Python's standard library unicodedata module gives you access to the Unicode Character Database for any code point:

import unicodedata

# Character name
unicodedata.name("A")       # 'LATIN CAPITAL LETTER A'
unicodedata.name("€")       # 'EURO SIGN'
unicodedata.name("😀")      # 'GRINNING FACE'

# Unicode category
unicodedata.category("A")   # 'Lu' — Uppercase Letter
unicodedata.category("a")   # 'Ll' — Lowercase Letter
unicodedata.category("1")   # 'Nd' — Decimal Digit Number
unicodedata.category(" ")   # 'Zs' — Space Separator
unicodedata.category(".")   # 'Po' — Other Punctuation

# Numeric value
unicodedata.numeric("½")    # 0.5
unicodedata.numeric("⅓")    # 0.3333...
unicodedata.numeric("²")    # 2.0
unicodedata.decimal("5")    # 5

# Bidirectional category (for RTL text)
unicodedata.bidirectional("A")   # 'L' — Left-to-right
unicodedata.bidirectional("א")   # 'R' — Right-to-left

# Look up a character by name
unicodedata.lookup('SNOWMAN')          # '☃'
unicodedata.lookup('LATIN SMALL LETTER A WITH ACUTE')  # 'á'

Useful validation patterns using unicodedata

import unicodedata

def is_unicode_letter(char: str) -> bool:
    """Check if a single character is a letter in any script."""
    return unicodedata.category(char).startswith('L')

def is_unicode_digit(char: str) -> bool:
    """Check if a character is a decimal digit in any numeral system."""
    return unicodedata.category(char) == 'Nd'

def contains_only_letters_and_spaces(text: str) -> bool:
    return all(
        unicodedata.category(c).startswith('L') or c.isspace()
        for c in text
    )

def strip_accents(text: str) -> str:
    """Remove combining diacritical marks (accents)."""
    nfd = unicodedata.normalize('NFD', text)
    return ''.join(c for c in nfd if unicodedata.category(c) != 'Mn')

strip_accents("café")    # "cafe"
strip_accents("résumé")  # "resume"
strip_accents("Ångström") # "Angstrom"

Normalization

Unicode normalization is essential before comparing, hashing, or storing text. The same visual character may have multiple valid encodings:

import unicodedata

# "é" can be precomposed (NFC) or decomposed (NFD)
nfc = "\u00e9"           # é as single code point
nfd = "e\u0301"          # e + combining acute accent

nfc == nfd               # False — different code point sequences
len(nfc)                 # 1
len(nfd)                 # 2

# Normalize to NFC before comparison:
unicodedata.normalize('NFC', nfd) == nfc   # True

# All four normalization forms:
# NFC  — Canonical Decomposition, then Canonical Composition (web standard)
# NFD  — Canonical Decomposition
# NFKC — Compatibility Decomposition, then Composition (search/comparison)
# NFKD — Compatibility Decomposition

# NFKC collapses compatibility variants:
unicodedata.normalize('NFKC', "ﬁ")    # "fi"  — fi ligature
unicodedata.normalize('NFKC', "２")   # "2"   — fullwidth digit
unicodedata.normalize('NFKC', "①")   # "1"   — circled digit
unicodedata.normalize('NFKC', "ℌ")   # "H"   — script capital H

Best practice for web applications: - Store text as NFC in databases - Normalize to NFC immediately on user input - Use NFKC for search indexing and comparison where variant forms should match

def sanitize_text_input(text: str) -> str:
    """Normalize and clean user text input."""
    # NFC normalization
    text = unicodedata.normalize('NFC', text)
    # Strip leading/trailing whitespace
    text = text.strip()
    return text

Handling Encoding Errors in Production

import unicodedata
import logging

logger = logging.getLogger(__name__)

def safe_decode(data: bytes, encoding: str = 'utf-8') -> str:
    """Decode bytes, logging if replacement was needed."""
    try:
        return data.decode(encoding, errors='strict')
    except UnicodeDecodeError as e:
        logger.warning(
            "Unicode decode error — falling back to replacement",
            encoding=encoding,
            error=str(e),
            data_preview=data[:50].hex()
        )
        return data.decode(encoding, errors='replace')

def safe_encode(text: str, encoding: str = 'utf-8') -> bytes:
    """Encode text, replacing unencodable characters."""
    try:
        return text.encode(encoding, errors='strict')
    except UnicodeEncodeError as e:
        logger.warning(
            "Unicode encode error — using replacement",
            encoding=encoding,
            error=str(e)
        )
        return text.encode(encoding, errors='replace')

Working with Surrogate Characters

Python can encounter surrogate characters when reading filenames on Windows or processing some legacy data. The surrogatepass and surrogateescape error handlers deal with these:

# surrogateescape: maps invalid bytes to surrogate code points (U+DC80–U+DCFF)
# Useful for round-tripping bytes through str operations:
import os
import sys

# Reading filesystem paths with non-UTF-8 bytes (Linux/macOS)
filename_bytes = os.fsencode('/path/to/file')
filename_str = filename_bytes.decode('utf-8', errors='surrogateescape')
# Now you can manipulate filename_str as a str
back_to_bytes = filename_str.encode('utf-8', errors='surrogateescape')

# surrogatepass: allows encoding/decoding actual surrogate code points
# Used for Windows UCS-2 compatibility
data = "\uD800\uDC00"  # lone surrogates
data.encode('utf-16-le', errors='surrogatepass')  # works
data.encode('utf-8', errors='surrogatepass')       # works

emoji-and-supplementary-characters">Emoji and Supplementary Characters

Python 3 handles supplementary plane characters transparently:

# len() counts code points in Python 3
len("😀")      # 1 — correct, it's one code point
len("😀🎉")    # 2

# Iterating gives code points
for char in "Hi 😀":
    print(repr(char))
# 'H', 'i', ' ', '😀'

# But grapheme clusters still require external tools
# for combining marks and ZWJ sequences:
text = "👨‍💻"   # man + ZWJ + laptop computer
len(text)         # 3 — code points (man, ZWJ, laptop)
# Not 1 — Python doesn't do grapheme segmentation natively

# For grapheme-aware operations, use the 'grapheme' package:
# pip install grapheme
import grapheme

grapheme.length("👨‍💻")                  # 1
grapheme.length("café")                  # 4
list(grapheme.graphemes("Hi 😀"))        # ['H', 'i', ' ', '😀']

Regular Expressions with Unicode

Python's re module is Unicode-aware for str patterns by default:

import re

# \w matches Unicode letters and digits
re.match(r'^\w+$', 'café')    # matches
re.match(r'^\w+$', '名前')     # matches
re.match(r'^\d+$', '١٢٣')    # matches — Arabic-Indic digits

# For Unicode property escapes (\p{}), use the 'regex' package:
import regex

regex.match(r'^\p{L}+$', 'café')               # matches — any letters
regex.match(r'^\p{Script=Latin}+$', 'hello')   # matches
regex.findall(r'\X', 'café')  # ['c', 'a', 'f', 'é'] — grapheme clusters
regex.findall(r'\X', '👨‍💻')  # ['👨‍💻'] — full ZWJ sequence

# re.UNICODE flag is default for str; use re.ASCII to opt out:
re.match(r'^\w+$', 'café', re.ASCII)   # no match — ASCII only

Practical: Robust Text Processing Pipeline

import unicodedata
import re
from typing import Optional

def process_user_text(
    raw_input: str,
    max_length: Optional[int] = None,
    allow_emoji: bool = True,
) -> str:
    """
    Sanitize and normalize user-provided text.
    """
    # 1. NFC normalization
    text = unicodedata.normalize('NFC', raw_input)

    # 2. Strip control characters (except newlines/tabs)
    text = ''.join(
        c for c in text
        if unicodedata.category(c) not in ('Cc', 'Cf', 'Cs')
        or c in ('\n', '\r', '\t')
    )

    # 3. Optionally strip emoji
    if not allow_emoji:
        text = ''.join(
            c for c in text
            if not unicodedata.category(c).startswith('S')
        )

    # 4. Collapse multiple whitespace
    text = re.sub(r'[ \t]+', ' ', text)
    text = text.strip()

    # 5. Truncate by code point count
    if max_length and len(text) > max_length:
        text = text[:max_length]

    return text

Use the SymbolFYI Encoding Converter to explore the UTF-8 byte sequences for any character, and the Character Counter to analyze the Unicode properties of input text.

Next in Series: Unicode in URLs: Percent-Encoding, Punycode, and IRIs — how Unicode characters survive (or don't) in URLs, how internationalized domain names work, and the Python/JavaScript functions to handle URL encoding correctly.