SymbolFYI

JavaScript String & Code Points

Programming & Dev
Tanım

JS String methods for Unicode: codePointAt(), String.fromCodePoint(), and the spread operator for grapheme iteration.

Unicode in JavaScript

JavaScript's string type uses UTF-16 encoding internally, which creates important implications for working with characters outside the Basic Multilingual Plane (BMP)—particularly emoji and characters with code points above U+FFFF. Modern JavaScript (ES2015+) introduced several APIs to work with Unicode more correctly.

JavaScript String Encoding

JavaScript strings are sequences of UTF-16 code units. Characters in the BMP (U+0000–U+FFFF) occupy one code unit; supplementary characters (U+10000–U+10FFFF) are encoded as surrogate pairs—two code units working together:

'A'.length          // 1 (U+0041, single code unit)
'é'.length          // 1 (U+00E9, single code unit)
'☃'.length          // 1 (U+2603, single code unit)
'😀'.length         // 2 (U+1F600, surrogate pair: 0xD83D 0xDE00)
'👨‍👩‍👧‍👦'.length  // 11 (emoji sequence with ZWJ)

Unicode Escape Sequences

// 4-digit hex escape (BMP only)
'\u2603'            // '☃' (U+2603)

// ES2015 brace notation (any code point)
'\u{2603}'          // '☃' (U+2603)
'\u{1F600}'         // '😀' (U+1F600)

// Code point to character
String.fromCodePoint(0x2603)    // '☃'
String.fromCodePoint(0x1F600)   // '😀'
String.fromCodePoint(65, 66)    // 'AB'

Code Points vs. Code Units

// Old API: works with code units (UTF-16)
'😀'.charCodeAt(0)   // 55357 (0xD83D, high surrogate)
'😀'.charCodeAt(1)   // 56832 (0xDE00, low surrogate)

// ES2015 API: works with code points
'😀'.codePointAt(0)  // 128512 (0x1F600, correct)
'😀'.codePointAt(1)  // 56832 (starts from second code unit — be careful)

// Convert code point to character
String.fromCharCode(0xD83D, 0xDE00)   // '😀' (manual surrogate pair)
String.fromCodePoint(0x1F600)         // '😀' (clean code point API)

Iterating Over Characters

Do not use numeric index iteration for strings that may contain supplementary characters:

const text = '☃😀A';

// WRONG: iterates code units, breaks surrogate pairs
for (let i = 0; i < text.length; i++) {
  console.log(text[i]);  // '☃', '\uD83D', '\uDE00', 'A'
}

// CORRECT: for...of iterates code points
for (const char of text) {
  console.log(char);     // '☃', '😀', 'A'
}

// Spread also uses the iterator
[...text]               // ['☃', '😀', 'A']
Array.from(text)        // ['☃', '😀', 'A']

// Correct character count
Array.from(text).length  // 3

Regular Expressions and the u Flag

The u flag makes regex operate on code points rather than code units:

// WITHOUT u flag: . matches code units, not code points
/^.$/.test('😀')    // false (emoji is 2 code units)
/^..$/.test('😀')   // true (matches both surrogate code units)

// WITH u flag: . matches full code points
/^.$/u.test('😀')   // true

// Unicode property escapes (ES2018, requires u flag)
/\p{Emoji}/u.test('😀')           // true
/\p{Script=Latin}/u.test('A')    // true
/\p{Script=Cyrillic}/u.test('А') // true
/\p{Number}/u.test('²')          // true

Normalization

// é can be represented two ways:
const nfc = '\u00E9';           // Precomposed: é (1 code point)
const nfd = 'e\u0301';          // Decomposed: e + combining acute (2 code points)

nfc === nfd                    // false (different code points)
nfc.normalize() === nfd.normalize()  // true (both normalize to NFC)
nfc.normalize('NFD') === nfd.normalize('NFD')  // true

Grapheme Clusters with Intl.Segmenter

For user-visible character counting (splitting emoji sequences correctly):

const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
const segments = [...segmenter.segment('👨‍👩‍👧‍👦')];
segments.length  // 1 (one family emoji, one grapheme cluster)

İlgili Semboller

İlgili Terimler

İlgili Araçlar

İlgili Kılavuzlar