skip to Main Content

Context: I was building a piece of the back end in node end and optimizing for bandwidth. Basically I was trying to not send thumbnails (stored as 10KiB strings) that the client already had and yada yada, for a brief time I resorted to comparing the thumbnail data directly since I couldn’t be sure whether they had been updated but have moved to a different solution since. This is just a curiosity and bears no real relevance anymore.

Okay so I’ve discovered this behaviour by V8 where large string comparisons are slow by default but upon (presumably) hashing these strings, for instance by sticking them into a Set, the comparisons speed up a lot. I’ve confirmed this to be happening in both node as well as the chromium based electron front end.

Example:

    (()=>{
        //  making 10k character strings
        const big_strings = [];
        const palette = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
        for (let i = 0; i < 100; i++) {
            let big_string = "";
            for (let i = 0; i < 10_000; i++) {
                big_string += palette[Math.floor(Math.random() * palette.length)];
            }
            big_strings.push(big_string);
        }

        //  comparison functions
        function cmp1(a, b) {
            return a === b;
        }

        function cmp2(a, b) {
            const lol = new Set();
            lol.add(a);
            lol.add(b);
            return a === b;
        }

        //  tests
        let sorted_1, sorted_2;

        console.time("Test 1");
        for (let i = 0; i < 100; i++) {
            sorted_1 = big_strings.slice();
            sorted_1.sort(cmp1);
        }
        console.timeEnd("Test 1");

        console.time("Test 1");
        for (let i = 0; i < 100; i++) {
            sorted_1 = big_strings.slice();
            sorted_1.sort(cmp1);
        }
        console.timeEnd("Test 1");

        console.time("Test 1");
        for (let i = 0; i < 100; i++) {
            sorted_1 = big_strings.slice();
            sorted_1.sort(cmp1);
        }
        console.timeEnd("Test 1");

        console.time("Test 2");
        for (let i = 0; i < 100; i++) {
            sorted_2 = big_strings.slice();
            sorted_2.sort(cmp2);
        }
        console.timeEnd("Test 2");

        console.time("Test 1");
        for (let i = 0; i < 100; i++) {
            sorted_1 = big_strings.slice();
            sorted_1.sort(cmp1);
        }
        console.timeEnd("Test 1");

        //  sanity check
        let good = 0;
        const total = Math.max(sorted_1.length, sorted_2.length);
        for (let i = 0; i < total; i++) {
            if (sorted_1[i] === sorted_2[i]) {
                good++;
            } else {
                throw("Results do not match.");
            }
        }
        console.log(`Sanity check OK ${good} / ${total}`);


    })();

In the example, there are two comparison functions, one that simply does a === b and another that first sticks both a and b into a disposable Set. I can run tests with the first comparison function as many times as I like (3 in this example) and it will always be slow. Then I run the test with the second comparison function just once and afterwards, all tests on those particular strings will be blazing fast.

What gives with such behaviour? Why won’t V8 hash the strings itself in the background?

2

Answers


  1. You don’t do comparison here, you do equality, and seems using Set optimizes it. Afaik V8 strings are objects and there could be some equality check cashing involved with Set, since there’s a global string registry and equal strings are actually the same string object. So I guess V8 uses the same hash table for the string registry and for Set.

    But I wouldn’t rely on that since the implementation is a black box.

    Regarding your example, creating a set doesn’t help to actual comparison, just slows it down.

    ` Chrome/123
    ---------------------------------------------------------------------------------
    >              n=10        |      n=100       |     n=1000      |     n=10000    
    Test 1   ■ 1.00x x100k 692 | ■ 1.00x x10k 338 | ■ 1.00x x1k 138 | ■ 1.00x x10  56
    Test 2     1.18x x100k 819 |   1.74x x10k 589 |   2.87x x1k 396 |   1.96x x10 110
    ---------------------------------------------------------------------------------
    https://github.com/silentmantra/benchmark `
    
    //  making 10k character strings
    let $length = 10;
    const big_strings = [];
    const palette = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
    for (let i = 0; i < $length; i++) {
        let big_string = "";
        for (let i = 0; i < 10_0; i++) {
            big_string += palette[Math.floor(Math.random() * palette.length)];
        }
        big_strings.push(big_string);
    }
    
    let $input = big_strings;
    
    //  comparison functions
    function cmp1(a, b) {
        return a > b ? 1 : a < b ? -1 : 0;
    }
    
    function cmp2(a, b) {
        const lol = new Set();
        lol.add(a);
        lol.add(b);
        return a > b ? 1 : a < b ? -1 : 0;
    }
    
    
    // @benchmark Test 1
    $input.toSorted(cmp1);
    
    // @benchmark Test 2
    $input.toSorted(cmp2);
    
    
    
    /*@skip*/ fetch('https://cdn.jsdelivr.net/gh/silentmantra/benchmark/loader.js').then(r => r.text().then(eval));

    Btw if you compare individual characters that might work faster than Set:

    ` Chrome/123
    ------------------------------------------------------------------------------------
    >              n=10        |       n=100       |      n=1000      |     n=10000     
    Test 3   ■ 1.00x   x1m 168 | ■ 1.00x x100k 119 | ■ 1.00x x10k 112 | ■ 1.00x  x1k 135
    Test 2     3.13x   x1m 525 |   4.10x x100k 488 |   4.22x x10k 473 |   3.48x  x1k 470
    Test 1    15.42x x100k 259 |  26.05x  x10k 310 |  34.46x  x1k 386 |  48.81x x100 659
    ------------------------------------------------------------------------------------
    https://github.com/silentmantra/benchmark `
    
    //  making 10k character strings
    let $length = 10;
    const big_strings = [];
    const palette = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
    for (let i = 0; i < $length; i++) {
        let big_string = "";
        for (let i = 0; i < 10_0; i++) {
            big_string += palette[Math.floor(Math.random() * palette.length)];
        }
        big_strings.push(big_string);
    }
    
    let $input = big_strings;
    
    //  comparison functions
    function cmp1(a, b) {
        return a === b;
    }
    
    function cmp2(a, b) {
        const lol = new Set();
        lol.add(a);
        lol.add(b);
        return a === b;
    }
    
    function cmp3(a, b) {
       for(let i=0;i<a.length;i++){
        if(a[i]!==b[i]) return false;
       }
       return true;
    }
    
    // @benchmark Test 1
    $input.toSorted(cmp1);
    
    // @benchmark Test 2
    $input.toSorted(cmp2);
    
    // @benchmark Test 3
    $input.toSorted(cmp3);
    
    /*@skip*/ fetch('https://cdn.jsdelivr.net/gh/silentmantra/benchmark/loader.js').then(r => r.text().then(eval));
    Login or Signup to reply.
  2. (V8 developer here.)

    It’s not the hashing (or lack thereof), it’s the flattening (or lack thereof) of strings that have been built up with thousands of += 'x' single-character concatenations.

    With a couple of exceptions, V8 usually handles string concatenations by creating a "cons string" (sometimes also called "rope"), which is essentially a pair of pointers at its two halves; with the += 'x' strategy you’re hence creating a very long linked list of single-character strings. Some operations require (and therefore cause) flattening of such strings, others don’t. Always flattening cons strings automatically (doesn’t matter whether it’s in the background or not) would have undesirable costs in terms of CPU and memory.

    Here’s a way to speed up the test massively by constructing the strings more efficiently:

    (()=>{
      //  making 10k character strings
      const big_strings = [];
      const palette = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
      for (let i = 0; i < 100; i++) {
        // CHANGED: [], .push(...), .join("")
        let big_string = [];
        for (let i = 0; i < 10_000; i++) {
          big_string.push(palette[Math.floor(Math.random() * palette.length)]);
        }
        big_strings.push(big_string.join(""));
        // End of changes, everything below is unchanged.
      }
    
      //  comparison functions
      function cmp1(a, b) {
        return a === b;
      }
    
      function cmp2(a, b) {
        const lol = new Set();
        lol.add(a);
        lol.add(b);
        return a === b;
      }
    
      //  tests
      let sorted_1, sorted_2;
    
      console.time("Test 1");
      for (let i = 0; i < 100; i++) {
        sorted_1 = big_strings.slice();
        sorted_1.sort(cmp1);
      }
      console.timeEnd("Test 1");
    
      console.time("Test 1");
      for (let i = 0; i < 100; i++) {
        sorted_1 = big_strings.slice();
        sorted_1.sort(cmp1);
      }
      console.timeEnd("Test 1");
    
      console.time("Test 1");
      for (let i = 0; i < 100; i++) {
        sorted_1 = big_strings.slice();
        sorted_1.sort(cmp1);
      }
      console.timeEnd("Test 1");
    
      console.time("Test 2");
      for (let i = 0; i < 100; i++) {
        sorted_2 = big_strings.slice();
        sorted_2.sort(cmp2);
      }
      console.timeEnd("Test 2");
    
      console.time("Test 1");
      for (let i = 0; i < 100; i++) {
        sorted_1 = big_strings.slice();
        sorted_1.sort(cmp1);
      }
      console.timeEnd("Test 1");
    
      //  sanity check
      let good = 0;
      const total = Math.max(sorted_1.length, sorted_2.length);
      for (let i = 0; i < total; i++) {
        if (sorted_1[i] === sorted_2[i]) {
          good++;
        } else {
          throw("Results do not match.");
        }
      }
      console.log(`Sanity check OK ${good} / ${total}`);
    
    })();

    Beware of microbenchmarks, because they will mislead you! If the strings in your real application are created differently, you will (quite likely) take away incorrect conclusions from your original test.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search