Bidirectional Text in Unicode: How RTL and LTR Scripts Coexist

Unicode Deep Dive Unicode Deep Dive إبريل 25, 2023

○ 1. What Is Unicode? The Universal Character Standard Explained
○ 2. Unicode Planes and Blocks: How 1.1 Million Code Points Are Organized
○ 3. Unicode Encodings Explained: UTF-8, UTF-16, and UTF-32 Compared
○ 4. Unicode Normalization: NFC, NFD, NFKC, and NFKD Explained
○ 5. Unicode Properties and Categories: Classifying Every Character
● 6. Bidirectional Text in Unicode: How RTL and LTR Scripts Coexist
○ 7. How Emoji Work in Unicode: From Code Points to Skin Tones
○ 8. CJK Unification: How Unicode Handles Chinese, Japanese, and Korean
○ 9. Unicode Version History: From 1.0 to 16.0 and Beyond
○ 10. Unicode CLDR: The Database Behind Every Localized App

جدول المحتويات

Most western developers write code and work with data in left-to-right scripts. But approximately 420 million people primarily use Arabic, and another 10 million use Hebrew — both right-to-left scripts. Add Persian (Farsi), Urdu, Syriac, Thaana, N'Ko, and a dozen other RTL scripts, and you have a substantial portion of humanity reading text in the opposite horizontal direction from English.

Unicode's approach to this challenge is the Unicode Bidirectional Algorithm (UBA), a specification that determines how characters from different directional scripts are laid out when they appear together on the same line. Getting this right is essential for any application that handles multilingual text.

Directionality Is a Character Property

Every Unicode character has a Bidi Class property — a value that tells the Bidi Algorithm how to treat it. The primary directional classes are:

Bidi Class	Description	Examples
L (Left-to-Right)	Strongly LTR	Latin letters, digits in LTR context
R (Right-to-Left)	Strongly RTL	Hebrew letters
AL (Arabic Letter)	Arabic/RTL letter	Arabic, Syriac, Thaana letters
EN (European Number)	Digits in LTR	0–9 (when surrounded by LTR)
AN (Arabic Number)	Arabic-Indic digits	Arabic-Indic numeral system
NSM (Nonspacing Mark)	Combining marks	Inherits from base
WS (Whitespace)	Whitespace	Space
ON (Other Neutral)	Other neutrals	Most punctuation
LRM/RLM	Control marks	U+200E / U+200F
LRE/RLE/LRO/RLO/PDF	Embedding controls	Explicit direction overrides
LRI/RLI/FSI/PDI	Isolate controls	Modern directional isolates

The algorithm uses these properties to determine the final visual order of characters on a line.

The Unicode Bidirectional Algorithm

The UBA is complex — the full specification (Unicode Standard Annex #9) runs to tens of pages — but the essential logic is:

Detect the paragraph embedding level: Is this paragraph primarily LTR (level 0) or RTL (level 1)?
Assign embedding levels to characters: Characters receive levels based on their Bidi Class and surrounding context.
Resolve neutral characters: Spaces, punctuation, and other neutral characters take their direction from surrounding strong characters.
Reorder for display: Characters at each level are visually reordered right-to-left where needed.

A Concrete Example

Consider the mixed string: Hello مرحبا World

In storage (logical order):

H e l l o   م ر ح ب ا   W o r l d
← LTR →    ← RTL →    ← LTR →

For display, the Arabic portion is reversed visually while maintaining semantic order:

Hello ابحرم World

The UBA handles this automatically, but the paragraph direction matters: if the containing paragraph is RTL, the layout is:

World مرحبا Hello

HTML: The `dir` Attribute

HTML provides straightforward control over text direction through the dir attribute:

<!-- Left-to-right paragraph (default in most browsers) -->
<p dir="ltr">Hello World</p>

<!-- Right-to-left paragraph -->
<p dir="rtl">مرحبا بالعالم</p>

<!-- Auto-detect based on first strong character -->
<p dir="auto">مرحبا</p>
<p dir="auto">Hello</p>

<!-- Set on the root element for a full RTL page -->
<html lang="ar" dir="rtl">

<!-- Inline direction change -->
<p>
  The Arabic word for "hello" is
  <span dir="rtl">مرحبا</span>
  which reads right-to-left.
</p>

The dir="auto" value is particularly useful for user-generated content where you do not know in advance whether text will be LTR or RTL.

The `<bdi>` Element

The <bdi> element (Bidirectional Isolation) is specifically designed for user-provided content embedded in surrounding text:

<!-- Without bdi: username could disrupt surrounding text direction -->
<p>User <b>مرحبا123</b> posted a comment.</p>

<!-- With bdi: username is isolated from surrounding text -->
<p>User <bdi>مرحبا123</bdi> posted a comment.</p>

The <bdi> element behaves like <span dir="auto"> but also isolates the text so its direction does not influence the surrounding paragraph's layout.

The `<bdo>` Element

The <bdo> element (Bidirectional Override) forces text into a specific direction, overriding the UBA's automatic detection:

<!-- Force RTL display regardless of content -->
<bdo dir="rtl">This text displays backwards</bdo>
<!-- Output: sdrawkcab syalpsid txet sihT -->

<!-- Force LTR in an RTL context -->
<bdo dir="ltr">כתובת IP: 192.168.1.1</bdo>

CSS: The `direction` Property

CSS provides direction (for block-level directionality) and unicode-bidi (for overrides):

/* Set direction for a block */
.rtl-block {
    direction: rtl;
}

/* Override bidi algorithm for inline content */
.rtl-override {
    direction: rtl;
    unicode-bidi: bidi-override;
}

/* Isolate inline content */
.bdi-like {
    unicode-bidi: isolate;
}

/* Logical properties adapt to direction automatically */
.adaptive {
    /* Instead of: margin-left, padding-right, border-left */
    margin-inline-start: 1rem;
    padding-inline-end: 0.5rem;
    border-inline-start: 2px solid blue;
    text-align: start;  /* Instead of: left */
}

CSS Logical Properties (inline-start, inline-end, block-start, block-end) are the modern approach to building RTL-compatible layouts. They automatically flip when direction: rtl is set, eliminating the need for separate RTL stylesheets.

Practical RTL CSS Architecture

/* Modern RTL-compatible component */
.card {
    display: flex;
    flex-direction: row;  /* Automatically reversed in RTL */
    gap: 1rem;
    padding-inline: 1.5rem;   /* Left+right in LTR, right+left in RTL */
    padding-block: 1rem;       /* Top+bottom (same in both directions) */
    text-align: start;         /* Left in LTR, right in RTL */
    border-inline-start: 4px solid var(--accent-color);
}

/* No need for separate [dir="rtl"] .card { } overrides */

Unicode Bidi Control Characters

Beyond the dir attribute and CSS, Unicode provides invisible control characters that embed directional instructions directly in text. These work in plain text contexts where HTML markup is not available.

Character	Code Point	Name	Purpose
LRM	U+200E	Left-to-Right Mark	Force LTR context for following neutrals
RLM	U+200F	Right-to-Left Mark	Force RTL context for following neutrals
LRE	U+202A	Left-to-Right Embedding	Begin LTR embedded sequence
RLE	U+202B	Right-to-Left Embedding	Begin RTL embedded sequence
LRO	U+202D	Left-to-Right Override	Force LTR, override algorithm
RLO	U+202E	Right-to-Left Override	Force RTL, override algorithm
PDF	U+202C	Pop Directional Formatting	End LRE/RLE/LRO/RLO
LRI	U+2066	Left-to-Right Isolate	Modern: LTR embedded, isolated
RLI	U+2067	Right-to-Left Isolate	Modern: RTL embedded, isolated
FSI	U+2068	First Strong Isolate	Auto-detect, isolated
PDI	U+2069	Pop Directional Isolate	End LRI/RLI/FSI

The isolate characters (LRI, RLI, FSI, PDI, added in Unicode 6.3) are the modern, preferred mechanism. Unlike the older embedding characters, they do not affect the surrounding text's bidi properties, making them safer for user-generated content.

Practical Use: Numbers in RTL Text

A common problem with RTL text is that numbers and punctuation can displace incorrectly. Consider a Hebrew address:

רחוב הרצל 15, ירושלים

In RTL context, this displays correctly: street name (RTL), number 15, city. But in a plain text system without explicit direction, the comma and number placement can go wrong.

# Python: embed LRM/RLM to fix neutral character direction
address_hebrew = "רחוב הרצל 15, ירושלים"

# Add RLM after the number to ensure comma is treated as RTL punctuation
fixed = "רחוב הרצל 15\u200F, ירושלים"

In JavaScript:

// Format phone numbers for RTL display
function formatPhoneForRTL(phone) {
    // Add LRM before phone number to prevent RTL context from reversing it
    return '\u200E' + phone + '\u200E';
}

// Or use FSI/PDI isolate pair for better isolation
function isolateLTR(text) {
    return '\u2066' + text + '\u2069';  // FSI ... PDI
}

Detecting Text Direction

For dynamic content where you do not know the input language in advance:

// Detect direction based on first strong character
function detectDirection(text) {
    for (const char of text) {
        const cp = char.codePointAt(0);
        // RTL ranges: Hebrew (U+0590-U+05FF), Arabic (U+0600-U+06FF),
        // Syriac, Thaana, etc.
        if (
            (cp >= 0x0590 && cp <= 0x05FF) ||  // Hebrew
            (cp >= 0x0600 && cp <= 0x06FF) ||  // Arabic
            (cp >= 0x0700 && cp <= 0x074F) ||  // Syriac
            (cp >= 0x0750 && cp <= 0x077F) ||  // Arabic Supplement
            (cp >= 0x0780 && cp <= 0x07BF) ||  // Thaana
            (cp >= 0x07C0 && cp <= 0x07FF) ||  // N'Ko
            (cp >= 0xFB50 && cp <= 0xFDFF) ||  // Arabic Presentation Forms-A
            (cp >= 0xFE70 && cp <= 0xFEFF)     // Arabic Presentation Forms-B
        ) {
            return 'rtl';
        }
        // If we hit a strongly LTR character first
        if (
            (cp >= 0x0041 && cp <= 0x007A) ||  // Latin
            (cp >= 0x00C0 && cp <= 0x024F)     // Latin Extended
        ) {
            return 'ltr';
        }
    }
    return 'ltr';  // Default
}

// Use in a content-editable field
document.getElementById('input').addEventListener('input', function(e) {
    const dir = detectDirection(e.target.value);
    e.target.setAttribute('dir', dir);
});

In Python:

import unicodedata

def get_bidi_class(char):
    return unicodedata.bidirectional(char)

def detect_paragraph_direction(text):
    """Return 'rtl' or 'ltr' based on first strong character."""
    for char in text:
        bc = unicodedata.bidirectional(char)
        if bc in ('R', 'AL'):
            return 'rtl'
        if bc == 'L':
            return 'ltr'
    return 'ltr'

print(detect_paragraph_direction("مرحبا بالعالم"))  # 'rtl'
print(detect_paragraph_direction("Hello World"))     # 'ltr'

The Trojan Source Attack

In 2021, researchers at the University of Cambridge disclosed a vulnerability they called Trojan Source (CVE-2021-42574). It exploits Unicode bidirectional control characters to make source code appear different to human reviewers than it does to compilers and interpreters.

Consider this seemingly innocuous code comment:

# Legitimate code
access_level = "user‮ ⁦# Check if admin⁩ ⁦"

What it actually contains (with invisible bidi characters revealed):

access_level = "user[RLO] [LRI]# Check if admin[PDI] [LRI]"

The RLO (Right-to-Left Override) reverses the visual display of the string content, making "user‮" appear to be "resu". A code reviewer sees a comment, but the compiler sees an active string assignment that could grant elevated privileges.

Mitigations

Disallow bidi control characters in source code: Most security-focused organizations now ban them. GitHub added warnings in 2021.
Code editor visualization: Configure your editor to make invisible characters visible.
Automated linting: Tools like ruff (Python) and eslint (JavaScript) can flag suspicious bidi characters.

# Search for bidi control characters in source code
grep -rn $'\u202a\|\u202b\|\u202c\|\u202d\|\u202e\|\u2066\|\u2067\|\u2068\|\u2069\|\u200e\|\u200f' ./src/

In Python with ruff, the RUF003 rule flags suspicious control characters in comments. Configure it in your pyproject.toml:

[tool.ruff.lint]
select = ["RUF003"]

Building RTL-Compatible Applications

Testing RTL Layouts

To test your UI with RTL text, use browser developer tools:

// Toggle RTL mode in browser console
document.documentElement.setAttribute('dir', 'rtl');
document.documentElement.setAttribute('lang', 'ar');

Or set it permanently for testing:

<!-- Test RTL layout -->
<html lang="ar" dir="rtl">

Django Localization for RTL Languages

Django's built-in localization handles RTL text direction for languages like Arabic, Hebrew, and Persian:

# settings.py
LANGUAGE_CODE = 'ar'  # Arabic
USE_I18N = True

# In templates:
# {% load i18n %}
# <html dir="{{ LANGUAGE_BIDI|yesno:'rtl,ltr' }}">

Input Validation

When accepting user input in a multilingual context, be careful about bidi control characters in data fields:

import unicodedata

BIDI_CONTROL_CHARS = set([
    '\u200E', '\u200F',  # LRM, RLM
    '\u202A', '\u202B', '\u202C', '\u202D', '\u202E',  # LRE/RLE/PDF/LRO/RLO
    '\u2066', '\u2067', '\u2068', '\u2069',  # LRI/RLI/FSI/PDI
])

def sanitize_for_storage(text: str) -> str:
    """Remove bidi control characters from user input."""
    return ''.join(c for c in text if c not in BIDI_CONTROL_CHARS)

Summary

Unicode bidirectional text handling involves three layers:

Character properties: Every character has a Bidi Class that informs the algorithm
The UBA: An algorithm that determines visual order from logical order
Markup and control characters: HTML dir, CSS direction, and Unicode control characters let you override or assist the algorithm

For web developers, the key practices are: - Always set lang and dir attributes on your HTML root element - Use <bdi> for user-generated content embedded in surrounding text - Adopt CSS logical properties for RTL-compatible layouts - Sanitize bidi control characters from untrusted user input (security) - Test your UI with actual RTL content, not just mirrored placeholder text

Use our Character Counter to inspect the bidi properties of text containing RTL scripts.

Next in Series: How Emoji Work in Unicode: From Code Points to Skin Tones — Discover the surprisingly complex encoding behind emoji sequences, ZWJ families, and skin tone modifiers.

الرموز ذات الصلة

$ Dollar Sign — Em Dash → Rightwards Arrow – En Dash ← Leftwards Arrow ! Exclamation Mark " Quotation Mark # Number Sign % Percent Sign & Ampersand ' Apostrophe ( Left Parenthesis

المسرد ذو الصلة

Bidirectional Text (Bidi) Variation Selector Zero-Width Joiner (ZWJ)

الأدوات ذات الصلة

📊

محلل الأحرف

تحليل النص للخصائص اليونيكود لكل حرف