Context: I was building a piece of the back end in node end and optimizing for bandwidth. Basically I was trying to not send thumbnails (stored as 10KiB strings) that the client already had and yada yada, for a brief time I resorted to comparing the thumbnail data directly since I couldn’t be sure whether they had been updated but have moved to a different solution since. This is just a curiosity and bears no real relevance anymore.
Okay so I’ve discovered this behaviour by V8 where large string comparisons are slow by default but upon (presumably) hashing these strings, for instance by sticking them into a Set, the comparisons speed up a lot. I’ve confirmed this to be happening in both node as well as the chromium based electron front end.
Example:
(()=>{
// making 10k character strings
const big_strings = [];
const palette = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
for (let i = 0; i < 100; i++) {
let big_string = "";
for (let i = 0; i < 10_000; i++) {
big_string += palette[Math.floor(Math.random() * palette.length)];
}
big_strings.push(big_string);
}
// comparison functions
function cmp1(a, b) {
return a === b;
}
function cmp2(a, b) {
const lol = new Set();
lol.add(a);
lol.add(b);
return a === b;
}
// tests
let sorted_1, sorted_2;
console.time("Test 1");
for (let i = 0; i < 100; i++) {
sorted_1 = big_strings.slice();
sorted_1.sort(cmp1);
}
console.timeEnd("Test 1");
console.time("Test 1");
for (let i = 0; i < 100; i++) {
sorted_1 = big_strings.slice();
sorted_1.sort(cmp1);
}
console.timeEnd("Test 1");
console.time("Test 1");
for (let i = 0; i < 100; i++) {
sorted_1 = big_strings.slice();
sorted_1.sort(cmp1);
}
console.timeEnd("Test 1");
console.time("Test 2");
for (let i = 0; i < 100; i++) {
sorted_2 = big_strings.slice();
sorted_2.sort(cmp2);
}
console.timeEnd("Test 2");
console.time("Test 1");
for (let i = 0; i < 100; i++) {
sorted_1 = big_strings.slice();
sorted_1.sort(cmp1);
}
console.timeEnd("Test 1");
// sanity check
let good = 0;
const total = Math.max(sorted_1.length, sorted_2.length);
for (let i = 0; i < total; i++) {
if (sorted_1[i] === sorted_2[i]) {
good++;
} else {
throw("Results do not match.");
}
}
console.log(`Sanity check OK ${good} / ${total}`);
})();
In the example, there are two comparison functions, one that simply does a === b and another that first sticks both a and b into a disposable Set. I can run tests with the first comparison function as many times as I like (3 in this example) and it will always be slow. Then I run the test with the second comparison function just once and afterwards, all tests on those particular strings will be blazing fast.
What gives with such behaviour? Why won’t V8 hash the strings itself in the background?
2
Answers
You don’t do comparison here, you do equality, and seems using
Set
optimizes it. Afaik V8 strings are objects and there could be some equality check cashing involved withSet
, since there’s a global string registry and equal strings are actually the same string object. So I guess V8 uses the same hash table for the string registry and forSet
.But I wouldn’t rely on that since the implementation is a black box.
Regarding your example, creating a set doesn’t help to actual comparison, just slows it down.
Btw if you compare individual characters that might work faster than
Set
:(V8 developer here.)
It’s not the hashing (or lack thereof), it’s the flattening (or lack thereof) of strings that have been built up with thousands of
+= 'x'
single-character concatenations.With a couple of exceptions, V8 usually handles string concatenations by creating a "cons string" (sometimes also called "rope"), which is essentially a pair of pointers at its two halves; with the
+= 'x'
strategy you’re hence creating a very long linked list of single-character strings. Some operations require (and therefore cause) flattening of such strings, others don’t. Always flattening cons strings automatically (doesn’t matter whether it’s in the background or not) would have undesirable costs in terms of CPU and memory.Here’s a way to speed up the test massively by constructing the strings more efficiently:
Beware of microbenchmarks, because they will mislead you! If the strings in your real application are created differently, you will (quite likely) take away incorrect conclusions from your original test.