Code Point vs Character vs Glyph: The Three Levels of Text

Reference Mai 2, 2023

Table des matières

Text is not as simple as "letters on a screen." Between the moment you press a key and the moment a shape appears in front of you, at least three conceptually distinct things happen. Unicode organizes them into three levels: the abstract character, the numeric code point, and the rendered glyph. Conflating these is one of the most common sources of confusion in text handling, internationalization, and typography.

The Three Layers at a Glance

Level	What it is	Example
Character	Abstract identity — the concept	The letter "A", regardless of language, font, or size
Code point	A number assigned to a character	U+0041
Glyph	A specific visual shape drawn by a font	The way Arial draws U+0041

These three layers are related but not equivalent. The mapping between them is not always one-to-one, and understanding where the mismatches occur is where the real insight lies.

Characters: Abstract Identity

A character in the Unicode sense is an abstract unit of information. It is not a shape, not a number — it is a concept. The character "A" is the idea of the letter A, independent of:

What font it is drawn in
What size it appears at
What language document it appears in
Whether it is bold, italic, or upright

Unicode calls these "abstract characters" to distinguish them from their encoded representations. The Unicode standard defines what characters exist and assigns properties to them: name, category (letter, digit, punctuation...), bidirectionality, case mapping, and hundreds of other attributes.

This abstraction is important because it means the same character can have many different visual representations. The character LATIN CAPITAL LETTER A (the abstract thing) looks different in Times New Roman, Helvetica, and a handwritten font — but it is still the same character.

Characters with complex identity

Some characters encode more than pure graphical identity. Consider:

U+0041 LATIN CAPITAL LETTER A — the letter A, uppercase
U+FF21 FULLWIDTH LATIN CAPITAL LETTER A — A in a wide format for CJK contexts
U+1D400 MATHEMATICAL BOLD CAPITAL A — A specifically for use in mathematical notation

These are three different characters — they have different code points, different properties, and different semantic contexts. Their glyphs may look similar, but they carry different information.

Code Points: Assigned Numbers

A code point is simply the integer that Unicode assigns to a character. It is a 21-bit number in the range 0 to 1,114,111 (0x10FFFF), written in U+ notation with at least four hex digits.

The code point is the bridge between the abstract world of characters and the concrete world of encodings. Every encoding scheme (UTF-8, UTF-16, UTF-32, Latin-1) is a function that maps code points to sequences of bytes. The character itself is agnostic to encoding — the code point is the stable identifier that every encoding references.

Character (abstract) → Code Point (integer) → Encoding (bytes)
      "A"                   65 / U+0041             0x41 (UTF-8)
                                                 0x41 0x00 (UTF-16LE)
                                             0x41 0x00 0x00 0x00 (UTF-32LE)

A single abstract character has one code point. But a single code point does not always correspond to a single user-visible character — that depends on combining character sequences and emoji sequences, which produce grapheme clusters from multiple code points.

One character, one code point... usually

Most common characters are a clean 1:1 mapping: - LATIN SMALL LETTER E → U+0065 - EURO SIGN → U+20AC - SNOWMAN → U+2603

Exceptions arise with compatibility equivalences — cases where Unicode has multiple code points for historically separate encodings of what is semantically the same character. For example, the multiplication sign can be encoded as U+00D7 (×) from Latin-1 or U+2715 (✕) from Dingbats, and they are different code points with slightly different appearances.

Glyphs: Visual Shapes in Fonts

A glyph is the actual visual representation — a drawing, a set of curves and fills, a bitmap — that a font contains for a given code point. Glyphs live inside font files (.ttf, .otf, .woff2).

The relationship between characters/code points and glyphs is handled by the font and the text rendering engine:

The text stack has a string of code points to render
It looks up each code point in the active font's character map (cmap table)
The cmap maps code points to glyph IDs within the font
The rendering engine (FreeType, DirectWrite, Core Text, etc.) draws the glyph

This lookup step is where the complexity lives.

One character, many glyphs

The same Unicode character can map to different glyphs depending on context:

Font variation: LATIN CAPITAL LETTER A appears as an upright serif glyph in Times New Roman, a geometric sans-serif glyph in Futura, and a rounded glyph in Comic Sans.

Contextual alternates: OpenType fonts can substitute different glyph variants based on surrounding characters. In a calligraphic Arabic font, the same character may have four different glyphs depending on whether it appears at the start, middle, end, or alone in a word.

Ligatures: The sequence f + i (two code points, two characters) is often rendered as a single ligature glyph ﬁ in professional typography fonts. Same characters, different glyph.

Mathematical variants: Bold, italic, script, Fraktur, and double-struck variants of letters all have separate code points in Unicode's Mathematical Alphanumeric Symbols block (U+1D400–U+1D7FF), but visually they map to glyphs that differ only in weight and style from their base letters.

One glyph, multiple code points

This is where confusables and homoglyphs emerge. Multiple different characters can have glyphs that look visually identical or nearly identical:

Characters	Code Points	Visual
Latin A	U+0041	A
Cyrillic А	U+0410	А
Greek Α (Alpha)	U+0391	Α
Mathematical Bold A	U+1D400	𝐀

The Latin, Cyrillic, and Greek versions of "A" look identical in many fonts but are completely different characters with different code points. This is the basis of homograph attacks (also called IDN homograph attacks), where a domain name like pаypal.com uses a Cyrillic А instead of a Latin A to create a deceptive but technically different URL.

Practical Implications for Developers

String comparison

Because the same visual text can be encoded multiple ways, naive byte comparison is unreliable for text equality. The word "café" can be encoded with é as a single precomposed code point (NFC form) or as e followed by a combining accent (NFD form). They look identical, but 'café' === 'café' returns false if they use different normalizations.

Always normalize before comparing:

import unicodedata

s1 = 'café'          # precomposed é
s2 = 'cafe\u0301'    # e + combining acute

s1 == s2                                           # False
unicodedata.normalize('NFC', s1) == unicodedata.normalize('NFC', s2)  # True

s1.normalize('NFC') === s2.normalize('NFC')  // true

Font fallback

When a font does not contain a glyph for a given code point, the operating system's text rendering engine performs font fallback: it looks for another installed font that does contain the glyph. The result is that a single string may be rendered using glyphs from multiple fonts — usually visually inconsistent.

Developers controlling typography (in apps, PDFs, game engines) need to consider whether their chosen font covers all the code points their users might enter. Web developers should specify appropriate fallback fonts in their CSS font stacks.

Security: confusable characters

Because different characters can have identical-looking glyphs, applications that accept user-supplied identifiers (usernames, domain names, file names) need to guard against confusable characters. Unicode provides the Confusables data set (in confusables.txt in the Unicode Character Database) listing character pairs with similar glyphs.

For security-sensitive applications: - Apply Unicode normalization - Restrict to a known script (e.g., reject mixed-script identifiers) - Use the Confusables data to detect attempted spoofing

Rendering vs storage

When you store text in a database, you store code points (via an encoding like UTF-8). When you display it, the font system maps those code points to glyphs. These are separate concerns:

A code point may exist in the database but render as a replacement box (□) because no font is available
A glyph may be visually missing (appearing as a question mark or rectangle) even though the code point is valid and stored correctly
Two identical code point sequences may render differently depending on the user's operating system, installed fonts, and font rendering engine

The separation between character/code point and glyph is why "it works on my machine" is a common issue in internationalization — your machine has fonts that the user's machine lacks.

Summary: The Clean Mental Model

Think of it as three concentric layers:

┌─────────────────────────────────────┐
│  GLYPH (what is drawn)              │
│  Font-specific, OS/renderer-specific│
│  ┌───────────────────────────────┐  │
│  │  CODE POINT (what is encoded) │  │
│  │  A number: U+0041             │  │
│  │  ┌─────────────────────────┐  │  │
│  │  │  CHARACTER (what it is) │  │  │
│  │  │  Abstract identity      │  │  │
│  │  └─────────────────────────┘  │  │
│  └───────────────────────────────┘  │
└─────────────────────────────────────┘

Character: the meaning — stable, abstract, defined by Unicode
Code point: the number — stable, assigned by Unicode, used by all encodings
Glyph: the appearance — varies by font, renderer, and context

When a character appears wrong, it is usually a glyph problem (font missing or fallback mismatch). When two visually identical strings behave differently, it is usually a code point problem (normalization or confusable characters). When text is garbled or unreadable, it is usually an encoding problem (code points interpreted under the wrong encoding).

Use the SymbolFYI Unicode Lookup tool to see a character's code point and properties, and the Character Counter tool to inspect exactly how many code points a string contains.