SymbolFYI

Unicode Sandwich Pattern

Programming & Dev
Definisi

A programming best practice: decode bytes → process text as Unicode → encode bytes. Keeps Unicode in the middle.

Unicode Sandwich

The Unicode sandwich is a software design pattern for correctly handling text encoding in programs that interact with the outside world. The name describes the structure: raw bytes on the outside (the bread), with pure Unicode strings in the middle (the filling). The principle was popularized in the Python community, particularly in Ned Batchelder's talk "Pragmatic Unicode" and the Python 3 documentation.

The Core Principle

Text encoding errors cluster at input and output boundaries. The pattern has three rules:

  1. Decode bytes to Unicode as early as possible — at the point where data enters the program
  2. Process everything as Unicode internally — never manipulate raw bytes as if they were text
  3. Encode Unicode back to bytes as late as possible — only when writing to a file, network socket, or other byte-oriented destination
[External World: bytes]  →  decode  →  [Program: Unicode str]  →  encode  →  [External World: bytes]
     files, network,                     all internal logic                      files, network,
     databases, APIs                                                              databases, APIs

Why It Matters

Without this discipline, encoding and decoding operations get scattered throughout a codebase. Bytes and strings get mixed, leading to:

  • UnicodeDecodeError and UnicodeEncodeError in unexpected places
  • Mojibake (garbled text from double-encoding or wrong encoding)
  • Security vulnerabilities from encoding confusion
  • Brittle code that works in one locale but fails in another

Python Example

import csv
from pathlib import Path

# BREAD (input): decode bytes to str immediately
def read_csv(path: Path) -> list[dict]:
    # Open with explicit encoding — decode at the boundary
    with open(path, encoding='utf-8') as f:
        return list(csv.DictReader(f))

# FILLING (processing): all str, no bytes
def process_records(records: list[dict]) -> list[dict]:
    result = []
    for record in records:
        # Pure Unicode operations — string methods, comparisons, formatting
        name = record['name'].strip().title()
        email = record['email'].lower()
        result.append({'name': name, 'email': email})
    return result

# BREAD (output): encode str back to bytes at the boundary
def write_csv(records: list[dict], path: Path) -> None:
    with open(path, 'w', encoding='utf-8', newline='') as f:
        writer = csv.DictWriter(f, fieldnames=['name', 'email'])
        writer.writeheader()
        writer.writerows(records)

# Main flow
records = read_csv(Path('input.csv'))    # bytes → str at input
processed = process_records(records)      # pure Unicode
write_csv(processed, Path('output.csv')) # str → bytes at output

Applying the Pattern to Network I/O

import socket

sock = socket.create_connection(('example.com', 80))

# BREAD (output): encode Unicode request to bytes
request = 'GET / HTTP/1.1\r\nHost: example.com\r\n\r\n'
sock.send(request.encode('utf-8'))  # encode at the boundary

# BREAD (input): decode bytes response to Unicode immediately
raw_response = sock.recv(4096)
response = raw_response.decode('utf-8', errors='replace')  # decode at the boundary

# FILLING: process as Unicode
if 'Content-Type' in response:
    print('Found Content-Type header')

Python 3 Enforcement

Python 3's strict separation of str (Unicode) and bytes makes the sandwich pattern the natural way to write code — mixing them raises a TypeError. Functions like open() default to the system encoding, which is why explicitly passing encoding='utf-8' is a best practice that enforces the input boundary.

Benefits

  • Encoding errors are caught immediately at the boundary, where they are easiest to diagnose
  • Internal code is simpler — no need to track which encoding a variable uses
  • Encoding strategy can be changed by modifying only the boundary code
  • Tests for business logic do not need to deal with bytes at all

Simbol Terkait

Istilah Terkait

Alat Terkait

Panduan Terkait