I would like to compare 2 large tables of objects from two different databases:
- an array of 2700 objects
- an array of 1800 objects
Each object is a scientific publication with 30 properties. The aim is to identify duplicate titles. But as the titles have not been entered in exactly the same way in the two databases, I want to make my comparison on the publication title using the Levenshtein distance implemented in this answer: Compare Strings Javascript Return %of Likely
I’ve tested two ways of doing this:
- using Map: I store each array in 2 different maps, using the publication title as the key. Then I loop through first map, then I loop through second map, and execute a Levenshtein test on the 2 keys.
- using only titles: I create 2 arrays of publication titles. Then I loop through first array, then I loop through second array, and execute a Levenshtein test on the 2 elements.
Both processes are extremely time-consuming. Do you think there’s a more efficient way of doing this?
Thank you very much for your answers.
2
Answers
Thank you very much for this very exhaustive answer.
1/ I don't think threads can help me, because the hurdle to figure out how to split the array, process it, and merge it back seems rather complicated.
2/ Optimization of the algorithm is indeed something to look for. But even without doing so, by simply displaying the 2 titles to be compared, the execution time was already quite high!
3/ I've also thought about reducing operations by eliminating obvious cases. I'll look into it operationally.
Finally, I'm also investigating merging my 2 publication databases under Zotero, and letting it find duplicates for me on its own. It's much less interesting intellectually, but it can do the job to a certain extent. Thanks again.
This question ties back to a basic performance-bottleneck inherent to all programming languages. I will assume your objective is to "simply" lower the processing time by any means necessary (so not just optimise the code it self but the way the calculations are done). Because there isn’t much else to say…
In that case, per my experience, those are the ways most would solve it:
threads
at it. Basically, use parallelism to speed up the process. Your situation can be relatively easily threaded.In this answer, I’ll give you solutions, not implementations. You may
subsequently research more on each of them, maybe implement a mix of them.
Threads
This solution is often the go-to one to any slow-processing problem one might have, but it isn’t always the best one and often adds substantial overhead. There are numerous posts and articles about that topic, so I won’t dive into the details here, but in your case,
worker threads
would probably help speed up the processing (I don’t think you needchild process
es, but I’ll namedrop them here just in case). The only hurdle will be figuring out how to split the array, process it, and merge it back.Optimise
A bit like scientific publications (ironically), building on top of the state of art is usually a good idea. Although I have nothing to say about the implementation that was given, I am unsure of its performance. There is a package called
fastest-levenshtein
which gives benchmarks of performance. Depending on how fast (or slow) it is compared to the function you already have, you might be better off using the package’s implementation.But as far as that goes, there isn’t much to optimise.
Minimize
This goes a bit in theorics, but you could consider a stepped process for your comparison to minimise the number of operations. This way of doing things might not be the fastest, especially for fewer items, but done right, it can lead to faster times. As a quick thought, you could organise the following processing pipeline:
Continue stripping the array