SymbolFYI

Replacement Character

Encoding
Định nghĩa

The diamond-question mark character (U+FFFD, �) displayed when a decoder encounters an invalid or unrecognizable byte sequence.

The replacement character U+FFFD (displayed as a diamond with a question mark) is a special Unicode character used as a substitute when a decoder encounters a byte sequence it cannot interpret as a valid character. It acts as a visible marker of a decoding error, preventing malformed input from causing crashes or security vulnerabilities while still allowing text processing to continue.

When It Appears

The replacement character is inserted by decoders in several situations:

  • Invalid byte sequences: bytes that do not conform to the encoding's rules (e.g., a lone 0xFF in UTF-8)
  • Truncated sequences: a multi-byte sequence that ends before the expected number of bytes
  • Unexpected continuation bytes: a UTF-8 continuation byte (10xxxxxx) without a valid leading byte
  • Lone surrogates: a UTF-16 high or low surrogate code unit without its partner, when converting to UTF-8
  • Code points out of range: values above U+10FFFF
  • Encoding mismatch: bytes decoded with the wrong encoding (though this usually produces mojibake rather than replacement characters)

UTF-8 Replacement Rules

The Unicode standard specifies that each maximal subpart of an ill-formed byte sequence should be replaced by exactly one U+FFFD. This 'maximal subpart' rule is important: it means one bad byte becomes one replacement character, preventing an attacker from using overlong sequences to obscure malicious input.

Input bytes (hex): 61 80 62
                   'a' (invalid) 'b'
Output:            a [U+FFFD] b
Input bytes (hex): f0 28 8c bc
                   (start 4-byte seq) (invalid 2nd byte) ...
Output:            [U+FFFD] ( [U+FFFD] [U+FFFD]

Decoding Behavior in Code

# Python: 'replace' error handler substitutes U+FFFD
bad_bytes = b'Hello \x80 World'
print(bad_bytes.decode('utf-8', errors='replace'))
# 'Hello [replacement char] World'

# Default 'strict' mode raises an exception
try:
    bad_bytes.decode('utf-8')
except UnicodeDecodeError as e:
    print(e)  # 'utf-8' codec can't decode byte 0x80 ...

# 'ignore' silently drops invalid bytes
print(bad_bytes.decode('utf-8', errors='ignore'))
# 'Hello  World'
// JavaScript TextDecoder: fatal mode (throws) vs lenient mode (replaces)
const badBytes = new Uint8Array([72, 101, 108, 108, 111, 0x80]);

const lenient = new TextDecoder('utf-8');
console.log(lenient.decode(badBytes));  // 'Hello' + replacement char

const strict = new TextDecoder('utf-8', { fatal: true });
try {
  strict.decode(badBytes);
} catch (e) {
  console.log(e.name);  // TypeError
}

Security Considerations

The replacement character strategy has important security implications. Naive decoders that skip or collapse invalid sequences could allow an attacker to bypass security checks by encoding a dangerous string (like ../) in a way that appears safe but decodes to the dangerous form for a second, lenient decoder. The maximal subpart replacement rule ensures consistent one-to-one substitution across conformant implementations.

Always choose a strategy explicitly: use errors='replace' when you must process all input regardless of quality, errors='strict' when invalid input should be rejected, and errors='ignore' only when you have verified that silently dropping bytes is acceptable.

Ký hiệu liên quan

Thuật ngữ liên quan

Công cụ liên quan