SymbolFYI

The History of Unicode: From Babel to a Universal Character Set

History History of Symbols 十一月 28, 2023

In 1987, two engineers — one from Xerox, one from Apple — started meeting informally to work on a problem that the computing industry had quietly agreed to ignore: the world's writing systems were a mess, and nobody in charge seemed to want to fix it.

Thirty-seven years later, the system they helped create encodes over 154,000 characters covering 168 scripts, including ancient languages that hadn't been written in centuries. Unicode is the invisible infrastructure that allows a single document to contain English, Arabic, Chinese, mathematical notation, ancient Egyptian hieroglyphs, and emoji — all simultaneously, without ambiguity.

This is the story of how that happened, and why it nearly didn't.

The Pre-Unicode Chaos

To appreciate what Unicode solved, you need to understand the specific kind of disorder it replaced.

By the mid-1980s, the computing world had fragmented into dozens of incompatible character encoding systems. ASCII (1963) covered American English. The ISO 8859 series created separate 8-bit encodings for Western European, Eastern European, Cyrillic, Arabic, Hebrew, and other regional scripts — each incompatible with the others, each reusing the 128 code points above ASCII for different purposes.

IBM's mainframes used EBCDIC. East Asian computing was particularly fragmented: Japanese text required either Shift-JIS (developed by Microsoft and ASCII Corporation) or EUC-JP (developed by AT&T), two different encodings for the same language that were not mutually compatible. Chinese text had GB2312 (simplified) and Big5 (traditional), both of them double-byte encodings that treated some byte values as the first byte of a two-byte sequence, leading to infamous bugs when bytes happened to contain ASCII special characters.

The practical consequences were severe. A Japanese email sent from one system would arrive as unreadable garbage on another. Web pages in the early 1990s required explicit charset declarations and still often displayed incorrectly. Database systems that needed to store multilingual content required elaborate workarounds. International software needed to be rebuilt, not just translated, for each target market.

The technical name for the resulting scrambled text — characters displayed incorrectly because the encoding was misidentified — is mojibake (文字化け, Japanese for "character transformation"). It was a daily occurrence in international computing throughout the 1980s and into the 1990s.

The Double-Byte Problem

For East Asian scripts, the problem was more fundamental than just having too many encodings. Kanji (Chinese characters used in Japanese), Chinese hanzi, and Korean hanja together account for tens of thousands of characters — far more than a single byte (256 values) could hold.

The industry response was the double-byte character set (DBCS): a variable-length encoding where some characters were represented by one byte and others by two. These encodings were fragile by design. The parser had to track whether it was reading a first byte or a second byte of a two-byte sequence. If you searched for a substring and happened to start in the middle of a two-byte character, you'd get corrupted results. Copying text required careful handling. Counting characters was ambiguous.

The engineers who worked with these systems daily knew there had to be a better way.

The Founding Vision

Joe Becker at Xerox and Lee Collins and Mark Davis at Apple began their informal collaboration in 1987 with a document titled "Unicode 88" — a proposal for a new universal character encoding. Becker's contribution was the conceptual framework: a 16-bit encoding that would provide 65,536 code points, enough (the thinking went in 1987) for all the world's languages.

The name "Unicode" was Becker's coinage. In his proposal, he wrote: "Unicode is intended to address the need for a workable, reliable world text encoding." The prefix "uni" carried a triple meaning: universal, uniform (fixed-width), and unique (each character gets exactly one code point).

The early design made a bet that would later need to be corrected: that 65,536 characters would be enough. The 16-bit fixed-width design — every character is exactly two bytes — was appealing for its simplicity. Fixed-width encoding made string indexing by character position trivial, which mattered enormously for software performance.

ISO's Parallel Effort

Meanwhile, the International Organization for Standardization (ISO) was working on its own universal character set: ISO/IEC 10646. The ISO effort was broader in scope and slower in pace, organized through national standards bodies with all the bureaucratic overhead that implies.

In 1991, the Unicode Consortium (formally incorporated in California) and the ISO 10646 working group agreed to synchronize their efforts. This was a pragmatic merger: the Unicode team had industry momentum and engineering talent, while ISO had international legitimacy and the participation of national standards bodies that would be needed for formal adoption. The two standards have been maintained in sync ever since — every Unicode code point has a corresponding ISO 10646 assignment.

Unicode 1.0: A Deliberate Act of Reduction

Unicode 1.0 was published in October 1991. It contained 7,161 characters.

The number seems small for a universal character set, and it was. The deliberate approach was to start with the scripts needed for modern computing and expand from there, rather than attempting to encode every historical writing system simultaneously.

More controversially, Unicode 1.0 made a decision that remains debated to this day: CJK Unification (also called Han Unification). Chinese, Japanese, and Korean writing systems share a large pool of characters with common ancestry — the same character might be written slightly differently in Chinese, Japanese, and Korean typography, but the underlying meaning and form are the same. Unicode decided to assign a single code point to these unified characters rather than separate code points for each regional variant.

The practical argument was compelling: encoding all CJK variants separately would require tens of thousands of additional code points just for typographic distinctions. The counter-argument, which came especially from Japanese users, was that the distinctions were linguistically and culturally meaningful, not merely typographic. A Japanese reader might find it jarring to see Japanese text rendered with Chinese-style glyphs.

The CJK Unified Ideographs block (U+4E00–U+9FFF and extensions) remains a standing illustration of the tradeoffs involved in designing a universal character set: pure technical efficiency versus cultural and linguistic specificity.

The Surrogate Pair Problem: When 65,536 Wasn't Enough

By the mid-1990s, it was becoming clear that 65,536 code points — the capacity of a 16-bit system — would not be sufficient after all.

The culprit was primarily CJK characters. After the initial CJK Unified Ideographs block, researchers identified tens of thousands of additional characters used in historical texts, rare literary works, and minority languages. There were also historical scripts to consider: Linear B, Egyptian hieroglyphs, various ancient writing systems that had never been encoded anywhere. And then there was the question of musical notation, mathematical symbols, and other specialized domains.

Unicode 2.0 (1996) introduced the solution: the surrogate pair mechanism. The 16-bit design space was reorganized into "planes": Plane 0 (the Basic Multilingual Plane, or BMP, U+0000–U+FFFF) for the most common characters, and Planes 1–16 (the "supplementary planes," U+10000–U+10FFFF) for everything else. Two ranges in the BMP — U+D800–U+DBFF (high surrogates) and U+DC00–U+DFFF (low surrogates) — were reserved to encode supplementary plane characters as pairs in UTF-16.

This was not a clean solution. The fixed-width simplicity of the original 16-bit design was gone. UTF-16 (the encoding that emerged from the 16-bit Unicode design) is now variable-width: BMP characters are two bytes, supplementary characters are four bytes (a surrogate pair). This means that string indexing by code unit no longer equals indexing by character, which caused and continues to cause bugs in software that doesn't handle surrogates correctly.

The JavaScript string model, which exposes strings as sequences of UTF-16 code units, still trips up developers who encounter emoji or rare CJK characters. A single emoji might have a .length of 2 in JavaScript because it's encoded as a surrogate pair.

utf-8-the-encoding-that-won">UTF-8: The Encoding That Won

In parallel with the 16-bit BMP and surrogate pair developments, Ken Thompson and Rob Pike (the creators of Unix and later Go) designed UTF-8 in 1992 — reportedly on a placemat during dinner. UTF-8 is a variable-width encoding for the full Unicode code point space:

  • Code points U+0000–U+007F (ASCII): 1 byte, identical to ASCII
  • Code points U+0080–U+07FF: 2 bytes
  • Code points U+0800–U+FFFF: 3 bytes
  • Code points U+10000–U+10FFFF: 4 bytes

The brilliant design choice was backward compatibility with ASCII: any valid ASCII file is also a valid UTF-8 file. This made adoption far easier — existing software that handled ASCII text would handle UTF-8 text correctly as long as it didn't need to count characters.

UTF-8 became the dominant encoding for the web, for Unix/Linux systems, for email, and for most modern software. As of 2024, over 98% of web pages use UTF-8. The surrogate pair complexity of UTF-16 is largely hidden from most developers — it's primarily visible in Windows APIs (which use UTF-16 internally) and JavaScript strings.

Key Milestones in Unicode's Growth

Version Year Characters Key Additions
1.0 1991 7,161 Latin, Greek, Cyrillic, CJK basics
2.0 1996 38,885 Surrogate pairs, Tibetan, extended CJK
3.0 1999 49,194 Sinhala, Khmer, various historical scripts
4.0 2003 96,382 Gothic, Cypriot, Limbu
5.0 2006 99,089 N'Ko, Buginese, Coptic
6.0 2010 109,449 Emoji (first standardized batch), Mandaic
7.0 2014 113,021 Caucasian Albanian, Linear A
8.0 2015 120,672 Emoji skin tone modifiers, Hatran
10.0 2017 136,690 Bitcoin sign (₿), Masaram Gondi
13.0 2020 143,859 Yezidi, Chorasmian
15.0 2022 149,186 Kaktovik numerals, Kawi
16.0 2024 154,998 Egyptian hieroglyph formatting controls, Garay

The Consortium's Governance

The Unicode Consortium is a non-profit membership organization headquartered in Mountain View, California. Its voting members include major technology companies — Google, Apple, Microsoft, Meta, Adobe, Huawei, IBM — as well as universities, governments, and individual members.

The process for adding new characters is deliberately conservative. A character proposal must demonstrate that the character is in actual modern use (not purely historical or theoretical), provide evidence of the writing system's community, and show that the character cannot be represented through combinations of existing characters. Proposals go through technical committees, public review periods, and ballot votes before characters are added.

This conservatism is intentional: code points, once assigned, are permanent. Unicode has a strict stability policy — a character's assigned code point will never change, will never be removed, and the character's fundamental properties (its script assignment, its category, its bidirectionality) will never be altered in ways that would break existing text.

The permanence policy creates its own challenges. Characters that were assigned with incorrect properties in early Unicode versions carry those properties forever, with workarounds added on top. The Tibetan characters in Unicode 1.0 were assigned based on incomplete understanding of the script and had to be heavily supplemented in later versions.

Looking Forward

Unicode 16.0 (September 2024) added 5,185 characters, bringing the total to 154,998. Active proposals under consideration include additional historical scripts, additional CJK characters from historical sources, and various specialized technical symbols.

The truly unfinished work is computational: making Unicode behave correctly in all contexts. Case folding (for case-insensitive matching), normalization (for combining characters), collation (for sorting), line breaking, and bidirectional text algorithms are all specified in Unicode's auxiliary standards, and all of them have edge cases that trip up software. The Unicode standard is not just a code point table — it's a comprehensive specification for text processing that most software implements only partially.

The Consortium's longer-term challenge is the same as when it was founded: bringing writing systems used by small or marginalized communities into the standard before those communities lose the institutional capacity to make the case for their own scripts. Dozens of scripts are in active use today that are not yet in Unicode.

You can look up any Unicode character, explore its properties, and see its encoding in multiple formats using our Unicode Lookup tool.


Next in Series: Unicode solved the problem of encoding the world's languages — but then it was asked to encode something its founders never anticipated: pictures. Read how a set of 176 symbols designed for Japanese pagers became a global phenomenon in The History of Emoji: From Japanese Pagers to Universal Language.

相关符号

相关术语

相关工具

更多指南