Unicode 与排版术语表

ARIA Label

An HTML attribute providing accessible names for elements, critical for making symbol-heavy UIs usable with screen readers.

Web & HTML

ASCII

American Standard Code for Information Interchange — a 7-bit encoding for 128 characters including English letters, digits, and control characters.

Encoding

Alt Code

A method on Windows to type characters by holding Alt and entering a numeric code on the numpad (e.g., Alt+0169 for ©).

Input Methods

Alt Text for Symbols

Best practices for providing accessible text alternatives for decorative and meaningful symbol characters.

Accessibility

Basic Multilingual Plane (BMP)

The first 65,536 code points of Unicode (U+0000 to U+FFFF), containing the most commonly used characters.

Unicode Standard

Bidirectional Text (Bidi)

Text that mixes left-to-right and right-to-left writing directions, requiring the Unicode Bidirectional Algorithm for proper display.

Unicode Standard

Box Drawing Characters

Unicode characters (U+2500–U+257F) designed to draw boxes and tables in text-based interfaces and terminal emulators.

Typography

Braille Patterns

256 Unicode characters (U+2800–U+28FF) representing all possible combinations of 8-dot Braille cells.

Accessibility

Bullet Character

A typographic symbol (•, U+2022) used for list items and decorative purposes in text.

Typography

Byte Order Mark (BOM)

A special Unicode character (U+FEFF) at the start of a file indicating its byte order and encoding format.

Encoding

CJK

Abbreviation for Chinese, Japanese, and Korean — refers to the unified set of ideographic characters shared across these writing systems.

Unicode Standard

CSS content Property

A CSS property that inserts content before or after an element, commonly used with Unicode escape sequences (e.g., content: '\2713').

Web & HTML

Character Reference

An HTML markup for inserting characters by number (☃ or ☃) or name (©), used for special or reserved characters.

Web & HTML

Character Set (Charset)

A defined set of characters recognized by a computing system. Often used interchangeably with 'encoding' though technically different.

Encoding

Code Point

A numerical value in the Unicode standard that maps to a specific character, written as U+ followed by hexadecimal digits (e.g., U+0041 for 'A').

Unicode Standard

Code Point vs Character vs Glyph

Understanding the three abstraction levels: a code point (number), a character (abstract), and a glyph (visual rendering).

Programming & Dev

Combining Character

A Unicode character that modifies the preceding base character, such as accents and diacritical marks.

Unicode Standard

Common Locale Data Repository (CLDR)

A project providing locale-specific formatting rules for dates, currencies, and language names used worldwide.

Unicode Standard

Compose Key

A key on Linux/Unix systems that starts a multi-key sequence to produce special characters (e.g., Compose + c + o → ©).

Input Methods

Confusables (Homoglyphs)

Characters that look similar or identical but have different code points (e.g., Latin 'A' U+0041 vs Cyrillic 'А' U+0410).

Unicode Standard

Curly Quotes (Smart Quotes)

Typographically correct quotation marks (“ ” ‘ ’) as opposed to straight/typewriter quotes (" ').

Typography

Dead Keys

Keyboard keys that don't produce a character immediately but modify the next keystroke (e.g., ´ then e → é).

Input Methods

Diacritical Mark

A mark added to a letter to change its pronunciation or meaning (e.g., acute accent é, umlaut ü, tilde ñ).

Typography

Ellipsis

A single character (…, U+2026) representing three dots, preferred over typing three periods.

Typography

Em Dash

A typographic dash (—, U+2014) the width of a letter M, used for parenthetical statements and breaks in thought.

Typography

Emoji

Pictographic symbols defined in Unicode, originally from Japanese mobile phones, now a universal visual communication system.

Unicode Standard

En Dash

A typographic dash (–, U+2013) the width of a letter N, used for ranges (e.g., 1–10) and relationships.

Typography

Encoding Detection

Techniques for detecting the character encoding of text files, including BOM sniffing, heuristics, and chardet libraries.

Programming & Dev

Font Fallback

A mechanism where the browser uses alternative fonts when the primary font lacks a glyph for a character.

Typography

Fullwidth & Halfwidth

Character variants occupying different widths in CJK typography. Fullwidth characters occupy the same space as CJK ideographs.

Typography

General Category

A Unicode property that classifies each character (e.g., Lu = Uppercase Letter, Sm = Math Symbol, So = Other Symbol).

Unicode Standard

Glyph

The visual representation of a character as rendered by a specific font. One character can have multiple glyphs across different fonts.

Typography

Grapheme Cluster

A user-perceived character that may consist of multiple code points (e.g., a base character + combining marks, or a flag emoji).

Unicode Standard

Grapheme Segmentation (UAX #29)

The Unicode algorithm for splitting text into user-perceived characters, handling emoji sequences, combining marks, etc.

Programming & Dev

HTML Entity

A string that begins with & and ends with ; used to display reserved or special characters in HTML (e.g., & for &, © for ©).

Web & HTML

IDN Homograph Attack

A phishing technique using visually similar Unicode characters in domain names to impersonate legitimate sites.

Programming & Dev

Input Method Editor (IME)

Software that enables typing characters not directly available on a keyboard, essential for CJK, Arabic, and other complex scripts.

Input Methods

Internationalized Domain Name (IDN)

A domain name containing non-ASCII characters, encoded via Punycode for DNS compatibility (e.g., münchen.de → xn--mnchen-3ya.de).

Web & HTML

JavaScript String & Code Points

JS String methods for Unicode: codePointAt(), String.fromCodePoint(), and the spread operator for grapheme iteration.

Programming & Dev

Kerning

The adjustment of spacing between specific character pairs for improved visual appearance (e.g., AV, To).

Typography

Latin-1 (ISO 8859-1)

A single-byte encoding for Western European languages covering 256 characters (U+0000–U+00FF).

Encoding

Letter Spacing (Tracking)

Uniform adjustment of space between all characters in a block of text, distinct from kerning.

Typography

Ligature

A single glyph combining two or more characters (e.g., fi, fl). Can be typographic (font feature) or Unicode characters.

Typography

Mathematical Alphanumeric Symbols

Unicode block (U+1D400–U+1D7FF) containing styled letters and digits used in mathematical notation (bold, italic, script, etc.).

Unicode Standard

Mojibake

Garbled text that results from decoding data with the wrong character encoding. Common when mixing Latin-1 and UTF-8.

Encoding

Non-Breaking Space

A space character (U+00A0) that prevents automatic line breaking at its position, keeping adjacent words together.

Typography

Private Use Area

Ranges of Unicode code points (U+E000–U+F8FF, etc.) reserved for custom characters defined by font vendors or applications.

Unicode Standard

Punycode

An encoding syntax for representing Unicode strings with ASCII characters, used in Internationalized Domain Names.

Web & HTML

Python unicodedata Module

Python standard library module for looking up Unicode character names, categories, and properties.

Programming & Dev

Regex Unicode Support

Using Unicode-aware regular expressions with flags like /u in JS and re.UNICODE in Python.

Programming & Dev

Replacement Character

The diamond-question mark character (U+FFFD, �) displayed when a decoder encounters an invalid or unrecognizable byte sequence.

Encoding

Screen Reader

Assistive technology that reads text and UI elements aloud. Unicode character names are used for symbol pronunciation.

Accessibility

Script

A Unicode property indicating which writing system a character belongs to (e.g., Latin, Greek, Common, Inherited).

Unicode Standard

Soft Hyphen

An invisible character (U+00AD) that marks where a word may be hyphenated at a line break. Invisible otherwise.

Typography

String Length vs Character Count

Why str.length in JavaScript returns UTF-16 code units, not visual characters — and how to count graphemes correctly.

Programming & Dev

Surrogate Pair

A pair of 16-bit code units in UTF-16 that together represent a single character outside the Basic Multilingual Plane (above U+FFFF).

Encoding

Tofu (Missing Glyph)

The empty rectangle (□) displayed when a font cannot render a character, named for its tofu-like appearance.

Typography

URL Encoding (Percent-Encoding)

A method of encoding special characters in URLs by replacing them with % followed by two hex digits of their UTF-8 byte values.

Web & HTML

UTF-16

A character encoding that uses 2 or 4 bytes per character. Used internally by JavaScript and Java.

Encoding

UTF-32

A fixed-width encoding using 4 bytes per character, simple but memory-intensive.

Encoding

UTF-8

A variable-width character encoding that uses 1 to 4 bytes to represent Unicode code points. The dominant encoding on the web.

Encoding

Unicode

A universal character encoding standard that assigns a unique number (code point) to every character across all writing systems.

Unicode Standard

Unicode Block

A contiguous range of code points defined by the Unicode standard, grouping related characters (e.g., 'Arrows' block: U+2190–U+21FF).

Unicode Standard

Unicode Collation

Sorting text according to language-specific rules using the Unicode Collation Algorithm (UCA, UTS #10).

Programming & Dev

Unicode Consortium

The non-profit organization that develops and maintains the Unicode Standard, adding new characters in annual releases.

Unicode Standard

Unicode Escape Sequence

A way to represent characters by their code point in programming languages (\u2603 in JS/Java, \u{2603} in ES6+, \U00002603 in Python).

Encoding

Unicode Hex Input

A macOS keyboard layout that allows typing characters by their hex code point (hold Option + type hex code).

Input Methods

Unicode Normalization

The process of converting Unicode text to a standard form (NFC, NFD, NFKC, NFKD) to ensure consistent comparison and storage.

Encoding

Unicode Plane

A group of 65,536 consecutive code points. Unicode has 17 planes (0–16), with Plane 0 being the BMP.

Unicode Standard

Unicode Property Escapes (\p{})

Regex syntax (\p{Script=Greek}, \p{Letter}) that matches characters by Unicode properties. Supported in JS, Java, Python 3.8+.

Programming & Dev

Unicode Sandwich Pattern

A programming best practice: decode bytes → process text as Unicode → encode bytes. Keeps Unicode in the middle.

Programming & Dev

Unicode Version

Numbered releases of the Unicode Standard (e.g., 16.0), each adding new characters, scripts, and emoji.

Unicode Standard

Unicode in URLs & IRIs

How Unicode characters in URLs are handled: IRI (RFC 3987), percent-encoding of UTF-8 bytes, and browser display.

Programming & Dev

Unihan Database

A comprehensive database of CJK ideographs with readings, meanings, and variant information maintained by the Unicode Consortium.

Unicode Standard

Variation Selector

Unicode characters (U+FE00–U+FE0F) that modify the appearance of the preceding character, including text vs emoji presentation.

Unicode Standard

WCAG Text Alternatives

WCAG 1.1.1 guideline requiring text alternatives for non-text content including symbols and icons.

Accessibility

Web Fonts (@font-face)

Custom fonts loaded via CSS @font-face rules, enabling rich typography beyond system-installed fonts.

Web & HTML

Whitespace Characters

Characters that represent horizontal or vertical space (space, tab, newline, etc.) but have no visible glyph.

Typography

Windows Emoji Panel

A Windows utility (Win+. or Win+;) for browsing and inserting emoji and special characters.

Input Methods

Windows-1252

A superset of Latin-1 used by default in legacy Windows applications, with extra characters in the 0x80–0x9F range.

Encoding

Zero-Width Joiner (ZWJ)

An invisible character (U+200D) that joins adjacent characters, commonly used in emoji sequences to create combined emoji.

Unicode Standard

Zero-Width Space

An invisible Unicode character (U+200B) that indicates a possible line break point without displaying any visible space.

Typography

macOS Character Viewer

A built-in macOS utility (Ctrl+Cmd+Space) for browsing and inserting Unicode characters and emoji.

Input Methods

unicode-range (CSS)

A CSS descriptor that specifies the range of Unicode code points a web font covers, enabling font subsetting.

Web & HTML