HTML Entities: The Complete Guide to Character References
- ● 1. HTML Entities: The Complete Guide to Character References
- ○ 2. CSS Content Property: Using Unicode Symbols in Stylesheets
- ○ 3. Unicode-Aware Regex: Property Escapes and Multilingual Patterns
- ○ 4. JavaScript and Unicode: Strings, Code Points, and Grapheme Clusters
- ○ 5. Python and Unicode: The Complete Developer's Guide
- ○ 6. Unicode in URLs: Percent-Encoding, Punycode, and IRIs
- ○ 7. IDN Homograph Attacks: When Unicode Becomes a Security Threat
- ○ 8. Web Fonts and Unicode Subsetting: Loading Only What You Need
- ○ 9. Character Encoding Detection: How to Identify Unknown Text Encoding
- ○ 10. Unicode Collation: How to Sort Text Correctly Across Languages
HTML entities are the mechanism HTML uses to represent characters that either cannot appear directly in markup or would be interpreted as markup syntax. Understanding them fully — not just & and < — makes you a more precise developer and helps you avoid subtle bugs in templates, APIs, and content pipelines.
What Is an HTML Entity?
An HTML entity is a text string that begins with & and ends with ;. The browser parser replaces it with the corresponding Unicode character before rendering. There are three forms:
Named entity references — human-readable names defined in the HTML specification:
© <!-- © -->
— <!-- — -->
… <!-- … -->
<!-- non-breaking space (U+00A0) -->
Decimal numeric character references — the Unicode code point in base 10:
© <!-- © (U+00A9) -->
— <!-- — (U+2014) -->
… <!-- … (U+2026) -->
Hexadecimal numeric character references — the code point in base 16, prefixed with x:
© <!-- © -->
— <!-- — -->
… <!-- … -->
All three forms for © are equivalent. The named form is the most readable; the hex form is the most common in generated output because it maps directly to Unicode code point notation (U+00A9 → ©).
When Escaping Is Required
The HTML specification only requires escaping in specific contexts. Knowing exactly where is important so you do not over-escape (breaking readability) or under-escape (introducing bugs or vulnerabilities).
In text content
The characters < and & must be escaped in text nodes because they start tag and entity syntax respectively:
<!-- Wrong: breaks parsing -->
<p>Use if (a < b) && (c > d) to compare.</p>
<!-- Correct -->
<p>Use if (a < b) && (c > d) to compare.</p>
> does not technically need escaping in text content, but escaping it is harmless and many sanitizers do it anyway.
In attribute values
Inside quoted attributes, you must escape the quote character being used and &:
<!-- Double-quoted: escape " and & -->
<a href="search?q=rock+&+roll&lang=en" title="Rock & Roll">
<!-- Single-quoted: escape ' and & -->
<a href='search?q=it's-fine'>
The ' entity is valid in HTML5 but was not in HTML 4. For maximum compatibility in HTML attributes, use ' or switch to double quotes.
In raw text elements
<script> and <style> are raw text elements — the parser does not process entities inside them. Do not use HTML entities inside JavaScript string literals embedded in <script> tags:
<!-- Wrong: the & is NOT decoded inside <script> -->
<script>
const name = "Rock & Roll"; // literal string contains "&"
</script>
<!-- Correct -->
<script>
const name = "Rock & Roll";
</script>
If you need to embed user-controlled data into a <script> block, use JSON serialization, not HTML entity encoding.
The Trap
(U+00A0, NO-BREAK SPACE) is one of the most misused entities. It looks identical to a regular space but behaves differently:
- It prevents line breaking between adjacent words
- It is not collapsed by CSS
white-space: normal - Screen readers may announce it differently
- It is invisible in most text editors
<!-- Avoid using for layout spacing -->
<td> Padded text</td> <!-- use CSS padding instead -->
<!-- Legitimate use: prevent unwanted line breaks -->
<span>10 kg</span> <!-- keeps "10" and "kg" together -->
<span>§ 42</span> <!-- section number and its symbol -->
<span>Dr. Smith</span> <!-- title stays with name -->
For layout spacing, always use CSS padding, margin, or gap. Reserve for semantic no-break situations.
Named Entity Reference Table
The HTML5 specification defines over 2,000 named character references. Here are the ones you will actually use:
Punctuation and typography
| Entity | Character | Unicode | Description |
|---|---|---|---|
& |
& | U+0026 | Ampersand |
< |
< | U+003C | Less-than sign |
> |
> | U+003E | Greater-than sign |
" |
" | U+0022 | Quotation mark |
' |
' | U+0027 | Apostrophe (HTML5) |
— |
— | U+2014 | Em dash |
– |
– | U+2013 | En dash |
… |
… | U+2026 | Horizontal ellipsis |
« |
« | U+00AB | Left double angle quote |
» |
» | U+00BB | Right double angle quote |
“ |
" | U+201C | Left double quotation mark |
” |
" | U+201D | Right double quotation mark |
‘ |
' | U+2018 | Left single quotation mark |
’ |
' | U+2019 | Right single quotation mark |
Special spaces
| Entity | Character | Unicode | Description |
|---|---|---|---|
|
(NBSP) | U+00A0 | No-break space |
  |
(EN SP) | U+2002 | En space |
  |
(EM SP) | U+2003 | Em space |
  |
(THIN SP) | U+2009 | Thin space |
‌ |
(ZWNJ) | U+200C | Zero-width non-joiner |
‍ |
(ZWJ) | U+200D | Zero-width joiner |
Symbols and currency
| Entity | Character | Unicode | Description |
|---|---|---|---|
© |
© | U+00A9 | Copyright sign |
® |
® | U+00AE | Registered sign |
™ |
™ | U+2122 | Trade mark sign |
€ |
€ | U+20AC | Euro sign |
£ |
£ | U+00A3 | Pound sign |
¥ |
¥ | U+00A5 | Yen sign |
° |
° | U+00B0 | Degree sign |
± |
± | U+00B1 | Plus-minus sign |
× |
× | U+00D7 | Multiplication sign |
÷ |
÷ | U+00F7 | Division sign |
∞ |
∞ | U+221E | Infinity |
≠ |
≠ | U+2260 | Not equal to |
Arrows
| Entity | Character | Unicode | Description |
|---|---|---|---|
← |
← | U+2190 | Leftwards arrow |
→ |
→ | U+2192 | Rightwards arrow |
↑ |
↑ | U+2191 | Upwards arrow |
↓ |
↓ | U+2193 | Downwards arrow |
↔ |
↔ | U+2194 | Left right arrow |
HTML Entities vs. Direct Unicode Characters
Modern HTML documents are almost always UTF-8. In UTF-8, you can write most Unicode characters directly without entities:
<!-- Both are valid in UTF-8 HTML -->
<p>Copyright © 2024 Acme Corp</p>
<p>Copyright © 2024 Acme Corp</p>
<!-- Both produce identical DOM -->
<p>Price: €49.99</p>
<p>Price: €49.99</p>
The direct form is more readable in source and equally safe when your document is properly declared as UTF-8:
<meta charset="UTF-8">
Use entities when:
- Your editor or build pipeline cannot reliably preserve certain Unicode characters
- You are generating HTML in a context where the output encoding is not guaranteed to be UTF-8
- The character is invisible or confusable (e.g., over a literal non-breaking space that looks identical to a regular space)
- You need to store HTML snippets in a system that strips non-ASCII characters
Escaping in Template Engines
Every major template engine auto-escapes HTML by default. Know what your engine escapes:
# Django templates — auto-escapes &, <, >, ", '
{{ user_input }} # safe — escaped automatically
{{ user_input|safe }} # unsafe — disables escaping
{% autoescape off %}...{% endautoescape %} # unsafe block
// Jinja2 (Python) — same behavior as Django
{{ user_input }} // escaped
{{ user_input|safe }} // raw
// Handlebars (JS)
{{ user_input }} // escaped: & < > " ' ` =
{{{ user_input }}} // raw — triple curly means unescaped
// React JSX — auto-escapes text content
<p>{userInput}</p> // safe
<p dangerouslySetInnerHTML={{__html: raw}} /> // unsafe — name says it all
The common mistake is double-escaping: taking already-escaped HTML and running it through the escaper again, producing &lt; instead of <. If you see literal entity strings appearing in your rendered page, this is almost always the cause.
Generating Entities Programmatically
When building HTML in code, use your language's dedicated escaping function rather than implementing your own:
# Python — html module (stdlib)
import html
html.escape('<script>alert("xss")</script>')
# → '<script>alert("xss")</script>'
html.escape("Rock & Roll", quote=False) # don't escape quotes
# → 'Rock & Roll'
html.unescape('<p>Hello</p>')
# → '<p>Hello</p>'
// JavaScript — no stdlib function, but this pattern is reliable
function escapeHtml(str) {
return str
.replace(/&/g, '&')
.replace(/</g, '<')
.replace(/>/g, '>')
.replace(/"/g, '"')
.replace(/'/g, ''');
}
// Or use the DOM itself (browser only)
function escapeHtml(str) {
const div = document.createElement('div');
div.textContent = str;
return div.innerHTML;
}
Never build your own escaper by replacing just < and >. Missing & means a second pass of escaping will corrupt already-escaped content, and missing " opens attribute injection vulnerabilities.
Common Pitfalls
Entities in JSON: JSON does not use HTML entities. If you are storing HTML-escaped content in JSON, the consumer must un-escape it. Prefer storing raw content in JSON and escaping at render time.
Entities in email: Many email clients have inconsistent HTML support. Numeric entities are safer than named entities in email HTML, as some older clients do not implement the full named entity list.
The semicolon is not always required: Legacy HTML parsers accept entities without a closing semicolon in some contexts (& parses as &). Always include the semicolon. Omitting it causes subtle breakage when the entity is followed by certain characters.
Not all named entities are in HTML 4: ' and many mathematical entities (∀, ∂, etc.) were added in HTML5. If you need to support very old parsers, use numeric references instead.
Practical Checklist
Before shipping HTML content, verify:
- Text content is escaped for
<and&at minimum - Attribute values are escaped for the quote character in use and
& - No HTML entities appear inside
<script>or<style>blocks - Template auto-escaping is enabled and not accidentally disabled with
|safeor{{{ }}} - No double-escaping in content pipelines that pass data through multiple layers
Use the SymbolFYI Encoding Converter to inspect code points and generate the correct entity form for any character.
Next in Series: CSS Content Property: Using Unicode Symbols in Stylesheets — how to inject Unicode characters via CSS ::before and ::after, write correct escape sequences, and handle accessibility when decorating with symbols.