SymbolFYI

Whitespace Characters

Typography
Định nghĩa

Characters that represent horizontal or vertical space (space, tab, newline, etc.) but have no visible glyph.

Whitespace in Unicode is not a single character but a category encompassing many characters that represent horizontal or vertical blank space. The Unicode standard defines whitespace characters through several properties, and the behavior of different whitespace characters varies significantly across programming languages, browsers, regular expressions, and text processing tools.

Unicode Whitespace Characters

U+0009  CHARACTER TABULATION (horizontal tab)
U+000A  LINE FEED
U+000B  LINE TABULATION (vertical tab)
U+000C  FORM FEED
U+000D  CARRIAGE RETURN
U+0020  SPACE
U+00A0  NO-BREAK SPACE
U+1680  OGHAM SPACE MARK
U+2000  EN QUAD
U+2001  EM QUAD
U+2002  EN SPACE
U+2003  EM SPACE
U+2004  THREE-PER-EM SPACE
U+2005  FOUR-PER-EM SPACE
U+2006  SIX-PER-EM SPACE
U+2007  FIGURE SPACE
U+2008  PUNCTUATION SPACE
U+2009  THIN SPACE
U+200A  HAIR SPACE
U+2028  LINE SEPARATOR
U+2029  PARAGRAPH SEPARATOR
U+202F  NARROW NO-BREAK SPACE
U+205F  MEDIUM MATHEMATICAL SPACE
U+3000  IDEOGRAPHIC SPACE

CSS White-Space Properties

CSS controls how whitespace in HTML source is rendered:

/* Default: collapse runs of whitespace, wrap lines */
.normal { white-space: normal; }

/* Preserve all whitespace, no wrapping */
.pre { white-space: pre; }

/* Preserve whitespace, allow wrapping */
.pre-wrap { white-space: pre-wrap; }

/* Collapse whitespace, no wrapping */
.nowrap { white-space: nowrap; }

/* Preserve line breaks only, collapse spaces */
.pre-line { white-space: pre-line; }

HTML Whitespace Collapsing

In HTML, by default, any sequence of whitespace characters (spaces, tabs, newlines) is collapsed to a single space for rendering, and newlines are treated as spaces:

<!-- These render identically -->
<p>Hello     World</p>
<p>Hello World</p>

<!-- To preserve whitespace -->
<pre>Hello     World</pre>

JavaScript Whitespace Handling

// \s in regex matches: space, tab, newline, CR, form feed, vertical tab,
// and in Unicode-aware mode, also Unicode whitespace
'hello  world'.replace(/\s+/g, ' '); // 'hello world'

// trim() removes: space, tab, newline, CR, form feed, vertical tab
// It does NOT remove NBSP (U+00A0) by default in many engines
'  hello  '.trim(); // 'hello'

// Check if char is any Unicode whitespace
function isUnicodeWhitespace(char) {
  return /^\p{White_Space}$/u.test(char);
}

// The \p{White_Space} Unicode property class covers all Unicode whitespace
'hello\u2003world'.replace(/\p{White_Space}/gu, '_'); // 'hello_world'

Typographic Space Characters

The variety of space widths allows fine typographic control:

En space:         &ensp;   (U+2002, 1/2 em)
Em space:         &emsp;   (U+2003, 1 em)
Thin space:       &thinsp; (U+2009, ~1/6 em)
Hair space:                (U+200A, thinner than thin)
Narrow NBSP:               (U+202F, narrow non-breaking)
Figure space:              (U+2007, same width as digits)

Python vs JavaScript Behavior

# Python str.split() without args splits on ALL Unicode whitespace
'hello\u2003world'.split()  # ['hello', 'world']

# Python str.strip() also handles Unicode whitespace
'\u2003hello\u2003'.strip()  # 'hello'

Understanding the full Unicode whitespace category is essential for building robust text processing systems that correctly handle international content.

Ký hiệu liên quan

Thuật ngữ liên quan

Công cụ liên quan

Hướng dẫn liên quan