SymbolFYI

Character Set (Charset)

Encoding
定义

A defined set of characters recognized by a computing system. Often used interchangeably with 'encoding' though technically different.

The charset parameter is a declaration in HTTP headers and HTML markup that tells a browser or parser which character encoding to use when converting raw bytes into text. Without this declaration, software must guess the encoding, which often leads to garbled text (mojibake). Explicit charset declarations are a foundational part of correct web content delivery.

Where charset Appears

HTTP Content-Type Header

The most authoritative place to declare encoding is the HTTP response header:

Content-Type: text/html; charset=utf-8
Content-Type: text/plain; charset=windows-1252
Content-Type: application/json; charset=utf-8

When present, the HTTP header charset takes precedence over any in-document declaration for HTML.

HTML <meta> Tag

In HTML5, the <meta charset> tag provides an in-document fallback:

<!DOCTYPE html>
<html lang='en'>
<head>
  <meta charset='utf-8'>
  <title>Page Title</title>
</head>

This must appear within the first 1,024 bytes of the document so that the browser can determine the encoding before parsing any further content. In HTML5, charset='utf-8' is the only recommended value.

The older HTML4 syntax is still valid but verbose:

<meta http-equiv='Content-Type' content='text/html; charset=utf-8'>

XML Declaration

XML documents can declare encoding in the processing instruction:

<?xml version='1.0' encoding='utf-8'?>

For UTF-8 and UTF-16 (with BOM), the XML declaration is optional but recommended.

MIME Charset Names

Charset names used in HTTP and HTML are defined by the IANA Character Sets registry. Common names:

MIME Name Encoding
utf-8 UTF-8
utf-16 UTF-16 with BOM
iso-8859-1 Latin-1 (treated as Windows-1252 by browsers)
windows-1252 Windows-1252
euc-kr Korean (EUC-KR)
shift_jis Japanese (Shift-JIS)

Names are case-insensitive: UTF-8, utf-8, and Utf-8 are equivalent.

Priority Order for HTML Encoding Detection

Browsers follow a defined priority order when determining encoding:

  1. HTTP Content-Type: charset (highest priority)
  2. Byte Order Mark (BOM) at start of document
  3. <meta charset> or <meta http-equiv='Content-Type'> pragma
  4. Browser sniffing / user override (lowest priority)

Accessing charset in Code

import urllib.request

with urllib.request.urlopen('https://example.com') as response:
    content_type = response.headers.get_content_charset()
    print(content_type)  # 'utf-8' (or None if not declared)
    html = response.read().decode(content_type or 'utf-8')
// fetch API: charset is embedded in Content-Type
const response = await fetch('https://example.com');
const contentType = response.headers.get('content-type');
console.log(contentType);  // 'text/html; charset=utf-8'
const charset = contentType.split('charset=')[1];

Best Practice

Always declare charset=utf-8 explicitly in both the HTTP header and the <meta charset> tag. Relying on browser sniffing is unreliable and can introduce security vulnerabilities -- some sniffing heuristics can be exploited via specially crafted content to misinterpret the encoding, enabling cross-site scripting attacks (the UTF-7 XSS vector is a historical example).

相关符号

相关术语

相关工具

相关指南