HTML Entities: The Complete Guide to Character References

Web Development Symbols for Developers ก.พ. 6, 2024

● 1. HTML Entities: The Complete Guide to Character References
○ 2. CSS Content Property: Using Unicode Symbols in Stylesheets
○ 3. Unicode-Aware Regex: Property Escapes and Multilingual Patterns
○ 4. JavaScript and Unicode: Strings, Code Points, and Grapheme Clusters
○ 5. Python and Unicode: The Complete Developer's Guide
○ 6. Unicode in URLs: Percent-Encoding, Punycode, and IRIs
○ 7. IDN Homograph Attacks: When Unicode Becomes a Security Threat
○ 8. Web Fonts and Unicode Subsetting: Loading Only What You Need
○ 9. Character Encoding Detection: How to Identify Unknown Text Encoding
○ 10. Unicode Collation: How to Sort Text Correctly Across Languages

สารบัญ

HTML entities are the mechanism HTML uses to represent characters that either cannot appear directly in markup or would be interpreted as markup syntax. Understanding them fully — not just & and < — makes you a more precise developer and helps you avoid subtle bugs in templates, APIs, and content pipelines.

What Is an HTML Entity?

An HTML entity is a text string that begins with & and ends with ;. The browser parser replaces it with the corresponding Unicode character before rendering. There are three forms:

Named entity references — human-readable names defined in the HTML specification:

&copy;   <!-- © -->
&mdash;  <!-- — -->
&hellip; <!-- … -->
&nbsp;   <!-- non-breaking space (U+00A0) -->

Decimal numeric character references — the Unicode code point in base 10:

&#169;   <!-- © (U+00A9) -->
&#8212;  <!-- — (U+2014) -->
&#8230;  <!-- … (U+2026) -->

Hexadecimal numeric character references — the code point in base 16, prefixed with x:

&#xA9;   <!-- © -->
&#x2014; <!-- — -->
&#x2026; <!-- … -->

All three forms for © are equivalent. The named form is the most readable; the hex form is the most common in generated output because it maps directly to Unicode code point notation (U+00A9 → ©).

When Escaping Is Required

The HTML specification only requires escaping in specific contexts. Knowing exactly where is important so you do not over-escape (breaking readability) or under-escape (introducing bugs or vulnerabilities).

In text content

The characters < and & must be escaped in text nodes because they start tag and entity syntax respectively:

<!-- Wrong: breaks parsing -->
<p>Use if (a < b) && (c > d) to compare.</p>

<!-- Correct -->
<p>Use if (a &lt; b) &amp;&amp; (c &gt; d) to compare.</p>

> does not technically need escaping in text content, but escaping it is harmless and many sanitizers do it anyway.

In attribute values

Inside quoted attributes, you must escape the quote character being used and &:

<!-- Double-quoted: escape " and & -->
<a href="search?q=rock+&amp;+roll&amp;lang=en" title="Rock &amp; Roll">

<!-- Single-quoted: escape ' and & -->
<a href='search?q=it&apos;s-fine'>

The ' entity is valid in HTML5 but was not in HTML 4. For maximum compatibility in HTML attributes, use ' or switch to double quotes.

In raw text elements

<script> and <style> are raw text elements — the parser does not process entities inside them. Do not use HTML entities inside JavaScript string literals embedded in <script> tags:

<!-- Wrong: the &amp; is NOT decoded inside <script> -->
<script>
  const name = "Rock &amp; Roll"; // literal string contains "&amp;"
</script>

<!-- Correct -->
<script>
  const name = "Rock & Roll";
</script>

If you need to embed user-controlled data into a <script> block, use JSON serialization, not HTML entity encoding.

The ` ` Trap

  (U+00A0, NO-BREAK SPACE) is one of the most misused entities. It looks identical to a regular space but behaves differently:

It prevents line breaking between adjacent words
It is not collapsed by CSS white-space: normal
Screen readers may announce it differently
It is invisible in most text editors

<!-- Avoid using &nbsp; for layout spacing -->
<td>&nbsp;&nbsp;&nbsp;Padded text</td>  <!-- use CSS padding instead -->

<!-- Legitimate use: prevent unwanted line breaks -->
<span>10&nbsp;kg</span>       <!-- keeps "10" and "kg" together -->
<span>§&nbsp;42</span>        <!-- section number and its symbol -->
<span>Dr.&nbsp;Smith</span>   <!-- title stays with name -->

For layout spacing, always use CSS padding, margin, or gap. Reserve   for semantic no-break situations.

Named Entity Reference Table

The HTML5 specification defines over 2,000 named character references. Here are the ones you will actually use:

Punctuation and typography

Entity	Character	Unicode	Description
`&`	&	U+0026	Ampersand
`<`	<	U+003C	Less-than sign
`>`	>	U+003E	Greater-than sign
`"`	"	U+0022	Quotation mark
`'`	'	U+0027	Apostrophe (HTML5)
`—`	—	U+2014	Em dash
`–`	–	U+2013	En dash
`…`	…	U+2026	Horizontal ellipsis
`«`	«	U+00AB	Left double angle quote
`»`	»	U+00BB	Right double angle quote
`“`	"	U+201C	Left double quotation mark
`”`	"	U+201D	Right double quotation mark
`‘`	'	U+2018	Left single quotation mark
`’`	'	U+2019	Right single quotation mark

Special spaces

Entity	Character	Unicode	Description
` `	(NBSP)	U+00A0	No-break space
`&ensp;`	(EN SP)	U+2002	En space
`&emsp;`	(EM SP)	U+2003	Em space
` `	(THIN SP)	U+2009	Thin space
`&zwnj;`	(ZWNJ)	U+200C	Zero-width non-joiner
`&zwj;`	(ZWJ)	U+200D	Zero-width joiner

Symbols and currency

Entity	Character	Unicode	Description
`©`	©	U+00A9	Copyright sign
`®`	®	U+00AE	Registered sign
`™`	™	U+2122	Trade mark sign
`€`	€	U+20AC	Euro sign
`£`	£	U+00A3	Pound sign
`¥`	¥	U+00A5	Yen sign
`°`	°	U+00B0	Degree sign
`±`	±	U+00B1	Plus-minus sign
`×`	×	U+00D7	Multiplication sign
`÷`	÷	U+00F7	Division sign
`∞`	∞	U+221E	Infinity
`≠`	≠	U+2260	Not equal to

Arrows

Entity	Character	Unicode	Description
`←`	←	U+2190	Leftwards arrow
`→`	→	U+2192	Rightwards arrow
`↑`	↑	U+2191	Upwards arrow
`↓`	↓	U+2193	Downwards arrow
`↔`	↔	U+2194	Left right arrow

HTML Entities vs. Direct Unicode Characters

Modern HTML documents are almost always UTF-8. In UTF-8, you can write most Unicode characters directly without entities:

<!-- Both are valid in UTF-8 HTML -->
<p>Copyright &copy; 2024 Acme Corp</p>
<p>Copyright © 2024 Acme Corp</p>

<!-- Both produce identical DOM -->
<p>Price: &euro;49.99</p>
<p>Price: €49.99</p>

The direct form is more readable in source and equally safe when your document is properly declared as UTF-8:

<meta charset="UTF-8">

Use entities when: - Your editor or build pipeline cannot reliably preserve certain Unicode characters - You are generating HTML in a context where the output encoding is not guaranteed to be UTF-8 - The character is invisible or confusable (e.g.,   over a literal non-breaking space that looks identical to a regular space) - You need to store HTML snippets in a system that strips non-ASCII characters

Escaping in Template Engines

Every major template engine auto-escapes HTML by default. Know what your engine escapes:

# Django templates — auto-escapes &, <, >, ", '
{{ user_input }}             # safe — escaped automatically
{{ user_input|safe }}        # unsafe — disables escaping
{% autoescape off %}...{% endautoescape %}  # unsafe block

// Jinja2 (Python) — same behavior as Django
{{ user_input }}        // escaped
{{ user_input|safe }}   // raw

// Handlebars (JS)
{{ user_input }}        // escaped: & < > " ' ` =
{{{ user_input }}}      // raw — triple curly means unescaped

// React JSX — auto-escapes text content
<p>{userInput}</p>                          // safe
<p dangerouslySetInnerHTML={{__html: raw}} /> // unsafe — name says it all

The common mistake is double-escaping: taking already-escaped HTML and running it through the escaper again, producing &lt; instead of <. If you see literal entity strings appearing in your rendered page, this is almost always the cause.

Generating Entities Programmatically

When building HTML in code, use your language's dedicated escaping function rather than implementing your own:

# Python — html module (stdlib)
import html

html.escape('<script>alert("xss")</script>')
# → '&lt;script&gt;alert(&quot;xss&quot;)&lt;/script&gt;'

html.escape("Rock & Roll", quote=False)  # don't escape quotes
# → 'Rock &amp; Roll'

html.unescape('&lt;p&gt;Hello&lt;/p&gt;')
# → '<p>Hello</p>'

// JavaScript — no stdlib function, but this pattern is reliable
function escapeHtml(str) {
  return str
    .replace(/&/g, '&amp;')
    .replace(/</g, '&lt;')
    .replace(/>/g, '&gt;')
    .replace(/"/g, '&quot;')
    .replace(/'/g, '&#39;');
}

// Or use the DOM itself (browser only)
function escapeHtml(str) {
  const div = document.createElement('div');
  div.textContent = str;
  return div.innerHTML;
}

Never build your own escaper by replacing just < and >. Missing & means a second pass of escaping will corrupt already-escaped content, and missing " opens attribute injection vulnerabilities.

Common Pitfalls

Entities in JSON: JSON does not use HTML entities. If you are storing HTML-escaped content in JSON, the consumer must un-escape it. Prefer storing raw content in JSON and escaping at render time.

Entities in email: Many email clients have inconsistent HTML support. Numeric entities are safer than named entities in email HTML, as some older clients do not implement the full named entity list.

The semicolon is not always required: Legacy HTML parsers accept entities without a closing semicolon in some contexts (&amp parses as &). Always include the semicolon. Omitting it causes subtle breakage when the entity is followed by certain characters.

Not all named entities are in HTML 4: ' and many mathematical entities (∀, ∂, etc.) were added in HTML5. If you need to support very old parsers, use numeric references instead.

Practical Checklist

Before shipping HTML content, verify:

Text content is escaped for < and & at minimum
Attribute values are escaped for the quote character in use and &
No HTML entities appear inside <script> or <style> blocks
Template auto-escaping is enabled and not accidentally disabled with |safe or {{{ }}}
No double-escaping in content pipelines that pass data through multiple layers

Use the SymbolFYI Encoding Converter to inspect code points and generate the correct entity form for any character.

Next in Series: CSS Content Property: Using Unicode Symbols in Stylesheets — how to inject Unicode characters via CSS ::before and ::after, write correct escape sequences, and handle accessibility when decorating with symbols.

สัญลักษณ์ที่เกี่ยวข้อง

° Degree Sign € Euro Sign ∞ Infinity × Multiplication Sign → Rightwards Arrow ± Plus-Minus Sign ÷ Division Sign ← Leftwards Arrow ↑ Upwards Arrow ↓ Downwards Arrow ≠ Not Equal To ↔ Left Right Arrow

อภิธานศัพท์ที่เกี่ยวข้อง

CSS content Property Character Reference HTML Entity

เครื่องมือที่เกี่ยวข้อง

🔄

ตัวแปลงการเข้ารหัส

แปลงอักขระระหว่างการเข้ารหัส HTML, CSS, JS และ Python

คู่มือเพิ่มเติม

Unicode Collation: How to Sort Text Correctly Across Languages

Master Unicode collation — the Unicode Collation Algorithm, locale-aware sorting in JavaScript and Python, PostgreSQL ICU collations, and common pitfalls.

Character Encoding Detection: How to Identify Unknown Text Encoding

Detect character encodings in unknown text — BOM sniffing, statistical analysis with chardet, ICU detection, and why heuristics sometimes fail.

Web Fonts and Unicode Subsetting: Loading Only What You Need

Optimize web font loading with unicode-range subsetting — reduce font file sizes, improve load times, and handle multilingual content efficiently.

Soft Hyphen: Controlling Line Breaks in Web Typography

Master the soft hyphen (U+00AD) — how it controls word breaking in HTML, CSS hyphens property, browser behavior differences, and when to use  vs CSS.

IDN Homograph Attacks: When Unicode Becomes a Security Threat

Learn how attackers use Unicode lookalike characters to create phishing domains — how IDN homograph attacks work and how browsers defend against them.

Entity	Character	Unicode	Description
` `	(NBSP)	U+00A0	No-break space
`&ensp;`	(EN SP)	U+2002	En space
`&emsp;`	(EM SP)	U+2003	Em space
` `	(THIN SP)	U+2009	Thin space
`&zwnj;`	(ZWNJ)	U+200C	Zero-width non-joiner
`&zwj;`	(ZWJ)	U+200D	Zero-width joiner

Entity	Character	Unicode	Description
` `	(NBSP)	U+00A0	No-break space
`&ensp;`	(EN SP)	U+2002	En space
`&emsp;`	(EM SP)	U+2003	Em space
` `	(THIN SP)	U+2009	Thin space
`&zwnj;`	(ZWNJ)	U+200C	Zero-width non-joiner
`&zwj;`	(ZWJ)	U+200D	Zero-width joiner