What Is Unicode? The Universal Character Standard Explained

Unicode Deep Dive Unicode Deep Dive Şub 14, 2023

● 1. What Is Unicode? The Universal Character Standard Explained
○ 2. Unicode Planes and Blocks: How 1.1 Million Code Points Are Organized
○ 3. Unicode Encodings Explained: UTF-8, UTF-16, and UTF-32 Compared
○ 4. Unicode Normalization: NFC, NFD, NFKC, and NFKD Explained
○ 5. Unicode Properties and Categories: Classifying Every Character
○ 6. Bidirectional Text in Unicode: How RTL and LTR Scripts Coexist
○ 7. How Emoji Work in Unicode: From Code Points to Skin Tones
○ 8. CJK Unification: How Unicode Handles Chinese, Japanese, and Korean
○ 9. Unicode Version History: From 1.0 to 16.0 and Beyond
○ 10. Unicode CLDR: The Database Behind Every Localized App

İçindekiler

If you have ever pasted text from one application into another and watched it turn into a string of question marks or boxes, you have witnessed an encoding mismatch. Understanding why that happens — and why it mostly does not happen anymore — requires understanding Unicode. This article explains what Unicode is, how it works, and why every developer who handles text needs to understand it.

The Problem Unicode Solved

Before Unicode, the computing world was fragmented. Every language community, every software vendor, and every operating system used its own character encoding scheme. ASCII (American Standard Code for Information Interchange), developed in 1963, defined 128 characters — enough for English letters, digits, punctuation, and control characters. It fit neatly in 7 bits.

ASCII worked fine for English-speaking engineers in the early internet era. But the moment software crossed linguistic borders, problems appeared. How do you represent an accented é, a German ü, or a Japanese kanji?

The industry responded with a patchwork of solutions:

Latin-1 (ISO 8859-1) added 128 more characters for Western European languages using the 8th bit.
Code page 437, 850, 1252 were Microsoft's Windows encodings for different regions.
Shift-JIS, EUC-JP, Big5 handled Japanese, Korean, and Traditional Chinese.
KOI8-R handled Russian Cyrillic.

There were hundreds of these encoding standards, and they were largely incompatible. A file saved as Shift-JIS would display as garbage when opened by software expecting Latin-1. Internationalizing software meant handling a different encoding for every target market. The web made things worse: a single server might receive requests from users all over the world, each potentially sending text in a different encoding.

Enter Unicode

The Unicode Consortium was founded in 1987 by a small group of engineers from Xerox and Apple who recognized the chaos and decided to fix it. Their goal was audacious: create a single encoding that could represent every character in every writing system, past and present.

The Unicode Standard was first published in 1991 as Unicode 1.0, covering 7,161 characters. Today, Unicode 16.0 (released September 2024) contains 154,998 characters spanning 168 scripts, plus thousands of symbols, emoji, and technical characters.

The Unicode Consortium — a nonprofit organization whose members include Apple, Google, Microsoft, Meta, Adobe, IBM, and many others — continues to develop and maintain the standard.

Code Points: Unicode's Core Concept

The most fundamental concept in Unicode is the code point. A code point is a number assigned to a specific character. Unicode defines a space of 1,114,112 possible code points, ranging from 0 to 1,114,111 (or in hexadecimal, 0 to 10FFFF).

Code points are written in the U+ notation: a capital U followed by a plus sign and the hexadecimal value of the code point, padded to at least four digits.

Character	Code Point	Description
A	U+0041	Latin Capital Letter A
é	U+00E9	Latin Small Letter E with Acute
中	U+4E2D	CJK Unified Ideograph (Chinese: middle)
😀	U+1F600	Grinning Face emoji
♠	U+2660	Black Spade Suit
ℕ	U+2115	Double-Struck Capital N (math)

Every Unicode character has exactly one code point. The mapping is stable — once a character is assigned a code point, that assignment never changes. This stability guarantee is fundamental to Unicode's usefulness.

You can look up any character's code point using our Unicode Lookup tool.

Character vs. Encoding: A Critical Distinction

This is where many developers get confused. Unicode the character set and Unicode encoding are different things.

Unicode assigns code points to characters. It says "the letter A is U+0041." It does not specify how that number is stored in memory or on disk. That is the job of an encoding.

Think of it this way: the number 65 can be written on paper in decimal ("65"), binary ("01000001"), or hexadecimal ("41"). The number is abstract; its written form depends on the notation you choose. Similarly, the code point U+0041 can be stored in several different ways:

UTF-8: 1 byte (0x41) for ASCII-range characters, 2-4 bytes for others
UTF-16: 2 or 4 bytes
UTF-32: always 4 bytes

These are all encodings of the same Unicode character set. We cover them in detail in Unicode Encodings Explained: UTF-8, UTF-16, and UTF-32 Compared.

What Unicode Covers

Unicode is not just "ASCII plus accents." Its scope is genuinely global and historical:

Modern Scripts All major modern writing systems are included: Latin, Cyrillic, Greek, Arabic, Hebrew, Devanagari (Hindi/Sanskrit), Bengali, Chinese (Simplified and Traditional), Japanese (Hiragana, Katakana, Kanji), Korean (Hangul), Tamil, Telugu, Thai, Tibetan, and dozens more.

Historic Scripts Unicode includes many ancient scripts: Egyptian Hieroglyphs, Cuneiform, Linear B, Gothic, Old Norse Runes, Phoenician, and others. These are primarily used by scholars and researchers.

Symbols and Notation Mathematical operators (∑, ∫, ∞), currency symbols (€, ¥, £, ₿), musical notation (𝄞, ♩), chess pieces (♔, ♛), playing cards, and specialized technical symbols all have code points.

Emoji As of Unicode 16.0, there are over 3,700 emoji-related code points, including base emoji, skin tone modifiers, and Zero Width Joiner sequences. We explore emoji in depth in How Emoji Work in Unicode.

Control Characters Non-printing characters inherited from ASCII (tab, newline, carriage return), plus Unicode-specific control characters for bidirectional text, zero-width joiners, and variation selectors.

How Code Points Are Organized

Unicode does not dump all 154,000+ characters into a flat list. They are organized into a hierarchy:

Planes: 17 groups of 65,536 code points each
Blocks: Named ranges within planes (e.g., "Basic Latin," "Arabic," "Emoticons")
Character properties: Each character has metadata — its General Category, Script, Bidi class, and more

The most important plane is Plane 0, the Basic Multilingual Plane (BMP), which contains the characters used by most modern languages. We explain the full structure in Unicode Planes and Blocks: How 1.1 Million Code Points Are Organized.

Unicode in Practice

In Python

Python 3 strings are Unicode by default. The ord() function returns a character's code point, and chr() converts back:

>>> ord('A')
65
>>> ord('é')
233
>>> ord('中')
20013
>>> chr(0x1F600)
'😀'

# Unicode escape in string literals
>>> '\u00e9'
'é'
>>> '\U0001F600'
'😀'

# Get Unicode name
import unicodedata
>>> unicodedata.name('é')
'LATIN SMALL LETTER E WITH ACUTE'
>>> unicodedata.name('😀')
'GRINNING FACE'

In JavaScript

JavaScript strings use UTF-16 internally. The codePointAt() method (ES6+) returns the correct code point even for characters outside the BMP:

// Basic code point access
'A'.codePointAt(0);      // 65
'é'.codePointAt(0);      // 233
'😀'.codePointAt(0);     // 128512 (0x1F600)

// Convert code point to character
String.fromCodePoint(0x1F600);  // '😀'

// Unicode escape in string literals
'\u00e9'          // 'é'
'\u{1F600}'       // '😀' (ES6 syntax, supports code points > U+FFFF)

In HTML

HTML files should declare their encoding in the <meta> tag and be saved as UTF-8:

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>Unicode Example</title>
</head>
<body>
  <!-- Direct UTF-8 characters -->
  <p>café, 中文, العربية, 日本語</p>

  <!-- HTML numeric character references (decimal) -->
  <p>&#233; &#20013; &#128512;</p>

  <!-- HTML numeric character references (hex) -->
  <p>&#xE9; &#x4E2D; &#x1F600;</p>
</body>
</html>

Why Developers Must Understand Unicode

Even with UTF-8 as the dominant encoding, developers still encounter Unicode-related issues:

String length confusion: In many languages, .length counts code units, not characters. The string "😀" has a .length of 2 in JavaScript because emoji above U+FFFF require two UTF-16 code units (a surrogate pair). Naive string manipulation can split characters in half.

Normalization bugs: The character é can be represented as a single code point (U+00E9) or as two code points (e + combining acute accent, U+0065 + U+0301). These look identical but are byte-different, causing comparison failures. See Unicode Normalization: NFC, NFD, NFKC, and NFKD Explained.

Sorting and collation: Alphabetical order is locale-dependent. In Swedish, ä sorts after z. Simple byte-order sorting produces wrong results for non-ASCII text.

Security issues: Unicode contains visually similar characters (homoglyphs) that can be used in phishing attacks — for example, using Cyrillic а (U+0430) instead of Latin a (U+0061) in a domain name.

Bidirectional text: Arabic and Hebrew read right-to-left. Mixing RTL and LTR text requires understanding Unicode's Bidirectional Algorithm. See Bidirectional Text in Unicode.

The Unicode Consortium Today

The Consortium operates through working groups that handle different aspects of the standard:

UTC (Unicode Technical Committee): Approves new characters and normative changes
CLDR (Common Locale Data Repository): Maintains locale-specific data (date formats, number formats, collation) — covered in Unicode CLDR: The Database Behind Every Localized App
ICU (International Components for Unicode): A widely-used C/C++/Java library implementing Unicode algorithms

New characters are added through a formal proposal process. Proposals must demonstrate real-world usage, lack of existing representation, and broad community support. The submission process for emoji — a subset of new character proposals — has become particularly well-known. We describe it in How Emoji Work in Unicode.

Summary

Unicode is the universal character standard that assigns a unique code point (written as U+XXXX) to every character in every writing system. It was created to replace the chaotic patchwork of incompatible legacy encodings. With over 154,000 characters covering 168 scripts, Unicode is the foundation of all modern text handling.

Key concepts to remember: - A code point is an abstract number assigned to a character (U+0041 = A) - A character is the abstract identity of a symbol - An encoding (UTF-8, UTF-16, UTF-32) is how code points are stored as bytes - Unicode and UTF-8 are not the same thing — Unicode is the character set, UTF-8 is one encoding of it

Use our Unicode Lookup tool to explore code points, and our Character Counter to analyze the Unicode properties of any text.

Next in Series: Unicode Planes and Blocks: How 1.1 Million Code Points Are Organized — Explore the 17 planes and hundreds of named blocks that give structure to Unicode's vast character space.