How Emoji Work in Unicode: From Code Points to Skin Tones

Unicode Deep Dive Unicode Deep Dive मई 16, 2023

○ 1. What Is Unicode? The Universal Character Standard Explained
○ 2. Unicode Planes and Blocks: How 1.1 Million Code Points Are Organized
○ 3. Unicode Encodings Explained: UTF-8, UTF-16, and UTF-32 Compared
○ 4. Unicode Normalization: NFC, NFD, NFKC, and NFKD Explained
○ 5. Unicode Properties and Categories: Classifying Every Character
○ 6. Bidirectional Text in Unicode: How RTL and LTR Scripts Coexist
● 7. How Emoji Work in Unicode: From Code Points to Skin Tones
○ 8. CJK Unification: How Unicode Handles Chinese, Japanese, and Korean
○ 9. Unicode Version History: From 1.0 to 16.0 and Beyond
○ 10. Unicode CLDR: The Database Behind Every Localized App

विषय सूची

Emoji are the most visible and culturally significant addition to Unicode in the past decade. What started as a small set of 176 icons from a Japanese carrier became, in a remarkably short time, a global visual language used by billions of people every day. But behind the cheerful faces and colorful symbols lies a surprisingly intricate encoding system — one that has stretched Unicode's original design in creative and occasionally awkward directions.

A Brief History

Emoji did not originate with Unicode. They were created by Shigetaka Kurita in 1999 for NTT DoCoMo, a Japanese mobile carrier, as part of its i-mode mobile internet service. The original 176 emoji were 12×12 pixel images encoded in a proprietary system.

Japanese carriers (DoCoMo, KDDI, SoftBank) each developed their own incompatible emoji sets. When these phones began to interact internationally, and as other manufacturers entered the market, the incompatibility became a problem. A smiley face sent from a SoftBank phone might arrive as a garbage character on a DoCoMo device.

Google and Apple pushed for Unicode to standardize emoji when building their mobile platforms. In Unicode 6.0 (2010), the first batch of 722 emoji-related characters was added to the standard. This was the inflection point that made emoji truly universal.

As of Unicode 16.0 (2024), there are 3,790 emoji in the Unicode standard, though the number of distinct visual representations is higher when you count modifier combinations.

Basic Emoji Encoding

Most emoji are straightforward: a single Unicode code point maps to an emoji character. These live primarily in the Supplementary Multilingual Plane (Plane 1) and the BMP.

emoji_examples = [
    ('\U0001F600', 'GRINNING FACE'),
    ('\U0001F914', 'THINKING FACE'),
    ('\u2764',     'HEAVY BLACK HEART'),     # BMP emoji
    ('\U0001F004', 'MAHJONG TILE RED DRAGON'),
    ('\U0001F1FA\U0001F1F8', 'US flag'),     # Multi-code-point
]

import unicodedata
for char, desc in emoji_examples:
    print(f'{char} — {desc}')
    for c in char:
        print(f'  U+{ord(c):04X} {unicodedata.name(c, "?")}')

Note that the heart ❤ (U+2764) is in the BMP — it was in Unicode long before emoji, as a dingbat character. Many emoji were "retrofitted" from existing characters.

Text vs. Emoji Presentation: Variation Selectors

Here is something that surprises many developers: some characters have two valid forms — a text form and an emoji form. Whether a character displays as monochrome text or as a colorful emoji depends on a Variation Selector.

VS-15 (U+FE0E, VARIATION SELECTOR-15): Forces text presentation
VS-16 (U+FE0F, VARIATION SELECTOR-16): Forces emoji presentation

Sequence	Display	Meaning
❤ (U+2764 alone)	❤	Heart, default presentation (text)
❤️ (U+2764 + VS-16)	❤️	Heart, emoji presentation
❤︎ (U+2764 + VS-15)	❤︎	Heart, text presentation (forced)
☎ (U+260E alone)	☎	Telephone, default (text in many fonts)
☎️ (U+260E + VS-16)	☎️	Telephone, emoji presentation

This is why your emoji picker might insert what looks like a single character but is actually two code points.

const heart = '❤';           // Just U+2764
const heartEmoji = '❤️';     // U+2764 + U+FE0F

console.log(heart.length);      // 1
console.log(heartEmoji.length); // 2 (or 3 if surrogate pair + VS)

// Correct length: use spread to count grapheme clusters
console.log([...heartEmoji].length);  // 2 code points

Python and the grapheme package handle this properly:

# pip install grapheme
import grapheme

text = "I ❤️ you"  # Contains heart + VS-16
print(len(text))                    # 8 (raw code units)
print(grapheme.length(text))        # 7 (grapheme clusters)

Skin Tone Modifiers

When emoji featuring people were introduced, skin tone diversity became an obvious need. Unicode 8.0 (2015) introduced five Fitzpatrick scale modifiers based on the dermatological Fitzpatrick skin type system:

Modifier	Code Point	Skin Tone
🏻	U+1F3FB	Light (Type I-II)
🏼	U+1F3FC	Medium-Light (Type III)
🏽	U+1F3FD	Medium (Type IV)
🏾	U+1F3FE	Medium-Dark (Type V)
🏿	U+1F3FF	Dark (Type VI)

A skin tone modifier is appended immediately after any Emoji Modifier Base character (a character with the Emoji_Modifier_Base property). The pair forms a single grapheme cluster.

# Base person emoji + skin tone modifier
base = '\U0001F44B'      # 👋 WAVING HAND SIGN
medium = '\U0001F3FD'    # 🏽 MEDIUM SKIN TONE

combined = base + medium  # 👋🏽

import unicodedata
print(unicodedata.name(base))    # WAVING HAND SIGN
print(unicodedata.name(medium))  # EMOJI MODIFIER FITZPATRICK TYPE-4

# It's two code points but one grapheme cluster
print(len(combined))             # 4 (two surrogate pairs in Python's UTF-16-like storage)
# Actually Python 3 strings are native Unicode, so:
print(len(combined))             # 2 (two code points)

# Grapheme cluster count
import grapheme
print(grapheme.length(combined)) # 1 (one visible emoji)

In JavaScript:

const hand = '👋';         // U+1F44B
const tone = '\u{1F3FD}';  // Medium skin tone modifier

const combined = hand + tone;  // '👋🏽'
console.log(combined.length);  // 4 (two surrogate pairs = 4 code units!)
console.log([...combined].length);  // 2 (two code points)

// For grapheme-aware processing, use the Intl.Segmenter API
const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
const segments = [...segmenter.segment(combined)];
console.log(segments.length);  // 1 (one grapheme cluster)

Zero Width Joiner Sequences

The most sophisticated emoji encoding mechanism is the Zero Width Joiner (ZWJ, U+200D) sequence. ZWJ is an invisible character that, when placed between two emoji, instructs rendering systems to combine them into a single composite emoji.

The best-known example is family emoji:

👨 + ZWJ + 👩 + ZWJ + 👧 + ZWJ + 👦 = 👨‍👩‍👧‍👦

This family emoji is encoded as:

U+1F468 (MAN)
U+200D  (ZWJ)
U+1F469 (WOMAN)
U+200D  (ZWJ)
U+1F467 (GIRL)
U+200D  (ZWJ)
U+1F466 (BOY)

Seven code points, one visible character.

man   = '\U0001F468'
woman = '\U0001F469'
girl  = '\U0001F467'
boy   = '\U0001F466'
zwj   = '\u200D'

family = man + zwj + woman + zwj + girl + zwj + boy
print(family)  # 👨‍👩‍👧‍👦

print(len(family))  # 7 (code points: 4 people + 3 ZWJs)
print(family.count(zwj))  # 3

import grapheme
print(grapheme.length(family))  # 1 (one grapheme cluster)

ZWJ sequences enable combinatorial flexibility without requiring a separate code point for each combination. Consider profession emoji:

👨‍💻 = MAN + ZWJ + LAPTOP (U+1F468 + U+200D + U+1F4BB)
👩‍💻 = WOMAN + ZWJ + LAPTOP (U+1F469 + U+200D + U+1F4BB)
🧑‍💻 = PERSON + ZWJ + LAPTOP (U+1F9D1 + U+200D + U+1F4BB)

Skin tone modifiers can be added to each person emoji in the sequence: - 👨🏽‍💻 = MAN + MEDIUM-TONE + ZWJ + LAPTOP (4 code points) - 👫🏽 = WOMAN + ZWJ + MAN sequences with tone modifiers get more complex

The couple with heart with different skin tones:

👩🏿‍❤️‍💋‍👨🏼

This single visible emoji is encoded as 8 code points: WOMAN, DARK-SKIN, ZWJ, HEAVY HEART, VS-16, ZWJ, KISS MARK, ZWJ, MAN, MEDIUM-LIGHT-SKIN. It illustrates how far the ZWJ system has been pushed.

Flag Emoji

National flag emoji are encoded using pairs of Regional Indicator symbols from the Supplementary Special-purpose Plane (see Unicode Planes and Blocks):

U+1F1E6 through U+1F1FF: Regional Indicator Symbols A through Z

The flag for the United States 🇺🇸 is:

U+1F1FA (Regional Indicator U) + U+1F1F8 (Regional Indicator S)

A rendering system that recognizes the two-letter ISO 3166-1 country code maps it to the appropriate flag image. Rendering systems that do not recognize the code display the two letters (🇺🇸 → US).

This design means: - Any recognized ISO 3166 code can become a flag emoji without Unicode adding new characters - Support varies widely by platform (Apple, Google, and Microsoft all render flags differently or incompletely) - Some country codes (like TX for Texas, which is not a country) have been proposed but are not in the standard

Tag Sequences for Subdivisions

For subnational flags (Scotland 🏴󠁧󠁢󠁳󠁣󠁴󠁿, England 🏴󠁧󠁢󠁥󠁮󠁧󠁿, Wales 🏴󠁧󠁢󠁷󠁬󠁳󠁿), Unicode uses a different mechanism: tag sequences using invisible characters from the Tags block (Plane 14, U+E0000–U+E007F).

The Scottish flag is:

U+1F3F4 (BLACK FLAG) + U+E0067 U+E0062 U+E0073 U+E0063 U+E0074 (tag: "gbsct") + U+E007F (CANCEL TAG)

This is an even more obscure mechanism — only a handful of subdivision flags are in the official RGI (Recommended for General Interchange) set.

Grapheme Clusters and String Operations

The practical impact of ZWJ sequences and modifiers is that character counting is non-trivial. A user who types "Hello 👨‍👩‍👧‍👦!" expects a character count of 8. A naive len() or .length will give a much higher number.

Python: Grapheme Segmentation

# pip install grapheme
import grapheme

text = "Hello 👨‍👩‍👧‍👦!"

print(len(text))                # Wrong: counts code points
print(grapheme.length(text))    # Correct: counts user-perceived characters

# Iterate grapheme clusters
for cluster in grapheme.graphemes(text):
    print(repr(cluster))
# 'H', 'e', 'l', 'l', 'o', ' ', '👨\u200d👩\u200d👧\u200d👦', '!'

# Safely slice text by grapheme count
def grapheme_slice(text, start, stop):
    clusters = list(grapheme.graphemes(text))
    return ''.join(clusters[start:stop])

print(grapheme_slice("Hello 👨‍👩‍👧‍👦!", 6, 7))  # '👨‍👩‍👧‍👦'

JavaScript: Intl.Segmenter

const text = "Hello 👨‍👩‍👧‍👦!";

// Wrong: code unit count
console.log(text.length);  // 17

// Wrong: code point count
console.log([...text].length);  // 10

// Correct: grapheme cluster count (Intl.Segmenter, Node.js 16+ / modern browsers)
const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
const clusters = [...segmenter.segment(text)];
console.log(clusters.length);  // 8

// Safe slicing
function graphemeSlice(text, start, end) {
    const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
    const segs = [...segmenter.segment(text)];
    return segs.slice(start, end).map(s => s.segment).join('');
}

console.log(graphemeSlice(text, 6, 7));  // '👨‍👩‍👧‍👦'

The RGI Set

Not every possible ZWJ combination is officially recognized. The RGI (Recommended for General Interchange) set, maintained in Unicode's emoji data files, defines which ZWJ sequences and modifier combinations are standardized and should be supported by conformant implementations.

Non-RGI ZWJ sequences are technically valid Unicode but may display as their constituent parts on most platforms. For example, a non-standard ZWJ sequence might show as 👨‍🍕 (man + ZWJ + pizza) even if no platform renders it as a single "pizza delivery person" emoji.

The Emoji Submission Process

Adding a new emoji to Unicode requires a formal proposal to the Unicode Consortium. The process:

Submit a proposal with justification, expected usage, visual distinctiveness
Emoji Subcommittee review: Applies criteria including frequency, distinctiveness, multiplatform usage, completeness of emoji usage
UTC vote: The Unicode Technical Committee must approve it
Vendor implementation: Major vendors (Apple, Google, Microsoft, Samsung, Twitter, Facebook) implement the character in their emoji fonts
Unicode release: Character appears in the next Unicode version

Key criteria for acceptance: - Distinctiveness: Cannot be represented by existing emoji - Completeness: Fills a logical gap in an existing sequence - Expected usage level: High anticipated use - Cross-compatibility: Makes sense across platforms and cultures - No logos: Brand logos are not accepted (₿ as a currency symbol is different from a brand emoji)

The process takes roughly 2 years from proposal to widespread availability.

Exploring Emoji in SymbolFYI

Use our Unicode Lookup tool to enter any emoji and see its full code point breakdown — including ZWJ sequences, variation selectors, and modifier components. The Character Counter correctly counts grapheme clusters, so you can see the difference between code point count and user-perceived character count for emoji-heavy text.

Summary

Emoji in Unicode use several encoding mechanisms: - Single code points: Most simple emoji (😀, 🎉, 🌍) - Variation Selectors (VS-15/VS-16): Switch between text and emoji presentation - Fitzpatrick modifiers: 5 skin tone modifiers applied to person/body emoji - ZWJ sequences: Invisible U+200D joins emoji into compound characters (families, professions) - Regional Indicators: Country flags from two-letter ISO codes - Tag sequences: Subdivision flags (Scotland, England, Wales)

The practical implication: never use .length or len() to count characters in text that may contain emoji. Always use a grapheme-aware method (Intl.Segmenter in JavaScript, grapheme library in Python).

Next in Series: CJK Unification: How Unicode Handles Chinese, Japanese, and Korean — Explore the controversial Han Unification decision and how Unicode manages tens of thousands of shared ideographs.