skip to Main Content

This is my input string:

str = `👍🤌🤔😊😎😂😘👌😒😍❤️🤣`;
// str.length = 24
// str.split("") = [ "ud83d", "udc4d", /* 22 more elements */ ],

So, when I call Array.from(str), I expect this would be what happens internally:

arr = Array.from({
  length: 24,
  0: "ud83d", 1: "udc4d" /* ... and so on */
})

And arr should be the same as str.split(""):

["ud83d", "udc4d", /* 22 more elements */ ]

But the value of arr is this:

// arr.length = 13
[
  "👍",  "🤌",  "🤔", "😊",  "😎",  "😂",
  "😘",  "👌",  "😒",  "😍",  "❤",  "️",
  "🤣"
]

For reference, this is equal to what we get if we call str.match(/[sS]/)gu. Why?

const str = `👍🤌🤔😊😎😂😘👌😒😍❤️🤣`
const arr = Array.from(str)
console.log(arr)

2

Answers


  1. JavaScript strings are stored as a sequence of UTF-16 code units.
    Each character may consist of one or two code units (surrogate pairs for emojis or other complex characters).

    String’s length reflects the total number of UTF-16 code units.

    Array.from operates on the string’s iterator, which respects Unicode code points.

    str.match(/./gu) – Uses a regular expression with the u flag (Unicode mode) to match graphemes globally.

    Unicode-aware methods like Array.from and str.match(/./gu) are essential for accurately processing strings containing emojis, accented characters, or other complex symbols.

    Login or Signup to reply.
  2. Each of these emojis is made up of multiple parts. Suddenly, this package came to mind, which could help you quickly overcome the problem:

    In JavaScript there is not always a one-to-one relationship between
    string characters and what a user would call a separate visual
    "letter". Some symbols are represented by several characters. This can
    cause issues when splitting strings and inadvertently cutting a
    multi-char letter in half, or when you need the actual number of
    letters in a string.

    For example, emoji characters like "🌷","🎁","💩","😜" and "👍" are
    represented by two JavaScript characters each (high surrogate and low
    surrogate). That is,

    "🌷".length == 2
    

    The combined emoji are even longer:

    "🏳️‍🌈".length == 6
    

    […]

    Enter the grapheme-splitter.js library. It can be used to properly split JavaScript strings into what a human user would call separate letters (or "extended grapheme clusters" in Unicode terminology), no matter what their internal representation is. It is an implementation on the Default Grapheme Cluster Boundary of UAX #29.

    You encountered the same issue in your example. When you split the string into individual characters, you ended up breaking apart emojis that consist of multiple characters, like ❤️.

    console.log("❤️".length) // 2

    By using the grapheme-splitter, you can properly split the emojis to array:

    const str = `👍🤌🤔😊😎😂😘👌😒😍❤️🤣`
    const splitter = new GraphemeSplitter()
    const arr = splitter.splitGraphemes(str)
    console.log(arr)
    
    // Output: ["👍", "🤌", "🤔", "😊", "😎", "😂", "😘", "👌", "😒", "😍", "❤️", "🤣"]
    <script src="https://cdn.jsdelivr.net/npm/[email protected]"></script>
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search