All,
Writing to see if anyone has any input on what they feel the best tech would be for the following scenario. Be it python, solr, redis, memcache, etc.
The situation is as follows.
I have 100 million+ binary strings which are around 1100 characters long…
‘0010100010101001010101011….’
What in your opinion would be the most logical way to do the following?
For a given string of the same number of characters, what would be the most efficient way to find the closest match? By closest, I mean sharing the greatest number of 0’s and 1’s at a given position. Hamming Distance, I believe.
My use case would actually involve taking 100k or so strings and trying to find their best match in the pool of 100 million+ strings.
Any thoughts? No particular tech has to be used, just preferably something that is fairly common.
Curious to see what ideas anyone may have.
Thanks,
Tbone
2
Answers
You could use numpy, R, or MATLAB, or anything else that works with large matrices for this:
Say you have a NxM matrix A, where N is len(string) and M is the number of strings. And say you have a string S you’re trying to match. You could:
You are basically trying to conduct nearest neighbor search in Hamming space on Elasticsearch.
Regarding this, a recently proposed FENSHSES method from [1] seems to be the state-of-the-art one on Elasticsearch.
[1] Mu, C, Zhao, J., Yang, G., Yang, B. and Yan, Z., 2019, October. Fast and Exact Nearest Neighbor Search in Hamming Space on Full-Text Search Engines. In International Conference on Similarity Search and Applications (pp. 49-56). Springer, Cham.