skip to Main Content

I would like to compare 2 large tables of objects from two different databases:

  • an array of 2700 objects
  • an array of 1800 objects

Each object is a scientific publication with 30 properties. The aim is to identify duplicate titles. But as the titles have not been entered in exactly the same way in the two databases, I want to make my comparison on the publication title using the Levenshtein distance implemented in this answer: Compare Strings Javascript Return %of Likely

I’ve tested two ways of doing this:

  • using Map: I store each array in 2 different maps, using the publication title as the key. Then I loop through first map, then I loop through second map, and execute a Levenshtein test on the 2 keys.
  • using only titles: I create 2 arrays of publication titles. Then I loop through first array, then I loop through second array, and execute a Levenshtein test on the 2 elements.

Both processes are extremely time-consuming. Do you think there’s a more efficient way of doing this?

Thank you very much for your answers.

2

Answers


  1. Chosen as BEST ANSWER

    Thank you very much for this very exhaustive answer.

    1/ I don't think threads can help me, because the hurdle to figure out how to split the array, process it, and merge it back seems rather complicated.

    2/ Optimization of the algorithm is indeed something to look for. But even without doing so, by simply displaying the 2 titles to be compared, the execution time was already quite high!

    3/ I've also thought about reducing operations by eliminating obvious cases. I'll look into it operationally.

    Finally, I'm also investigating merging my 2 publication databases under Zotero, and letting it find duplicates for me on its own. It's much less interesting intellectually, but it can do the job to a certain extent. Thanks again.


  2. This question ties back to a basic performance-bottleneck inherent to all programming languages. I will assume your objective is to "simply" lower the processing time by any means necessary (so not just optimise the code it self but the way the calculations are done). Because there isn’t much else to say…

    In that case, per my experience, those are the ways most would solve it:

    1. (easy) Optimise the compute-intensive task, in your case that would be the distance calculation.
    2. (meh) Minimize the number of operations.
    3. (hard) Throw threads at it. Basically, use parallelism to speed up the process. Your situation can be relatively easily threaded.

    In this answer, I’ll give you solutions, not implementations. You may
    subsequently research more on each of them, maybe implement a mix of them.

    Threads

    This solution is often the go-to one to any slow-processing problem one might have, but it isn’t always the best one and often adds substantial overhead. There are numerous posts and articles about that topic, so I won’t dive into the details here, but in your case, worker threads would probably help speed up the processing (I don’t think you need child processes, but I’ll namedrop them here just in case). The only hurdle will be figuring out how to split the array, process it, and merge it back.

    Optimise

    A bit like scientific publications (ironically), building on top of the state of art is usually a good idea. Although I have nothing to say about the implementation that was given, I am unsure of its performance. There is a package called fastest-levenshtein which gives benchmarks of performance. Depending on how fast (or slow) it is compared to the function you already have, you might be better off using the package’s implementation.

    But as far as that goes, there isn’t much to optimise.

    Minimize

    This goes a bit in theorics, but you could consider a stepped process for your comparison to minimise the number of operations. This way of doing things might not be the fastest, especially for fewer items, but done right, it can lead to faster times. As a quick thought, you could organise the following processing pipeline:

    1. Exact matches: first go through the array and check for exact matches (perhaps lowercased and without special characters), then remove them from the array
    2. On the stripped array, run a basic letter matching, then remove them from the array
    3. Continue stripping the array
    4. Do the distance calculation

    This method, though definitely not the best nor the fastest can still help stripping down the number of operations to a minimum by removing obvious cases. It wouldn’t be my first thought, but it’s a thought. I’m adding this option for completeness sake mostly.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search