skip to Main Content

All,

Writing to see if anyone has any input on what they feel the best tech would be for the following scenario. Be it python, solr, redis, memcache, etc.

The situation is as follows.

I have 100 million+ binary strings which are around 1100 characters long…
‘0010100010101001010101011….’

What in your opinion would be the most logical way to do the following?

For a given string of the same number of characters, what would be the most efficient way to find the closest match? By closest, I mean sharing the greatest number of 0’s and 1’s at a given position. Hamming Distance, I believe.

My use case would actually involve taking 100k or so strings and trying to find their best match in the pool of 100 million+ strings.

Any thoughts? No particular tech has to be used, just preferably something that is fairly common.

Curious to see what ideas anyone may have.

Thanks,
Tbone

2

Answers


  1. You could use numpy, R, or MATLAB, or anything else that works with large matrices for this:

    Say you have a NxM matrix A, where N is len(string) and M is the number of strings. And say you have a string S you’re trying to match. You could:

    1. Subtract the array version of S from A
    2. Take the the absolute value of all the elements of the result of (1)
    3. Sum the result of (2) along the axis of N
    4. Argsort the result of (3) to find the indexes of the strings that have the lowest distance to S.
    Login or Signup to reply.
  2. You are basically trying to conduct nearest neighbor search in Hamming space on Elasticsearch.

    Regarding this, a recently proposed FENSHSES method from [1] seems to be the state-of-the-art one on Elasticsearch.

    [1] Mu, C, Zhao, J., Yang, G., Yang, B. and Yan, Z., 2019, October. Fast and Exact Nearest Neighbor Search in Hamming Space on Full-Text Search Engines. In International Conference on Similarity Search and Applications (pp. 49-56). Springer, Cham.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search