I have a list containing terms that vary in length from 1 to 10 words, with approximately 500,000 entries. My goal is to search for these terms in a long text (converted from a PDF, typically 1.5 to 2 pages long). I need to perform the searches not only as exact matches but also using fuzzy (e.g., the term ‘Lionel Messi’ should match ‘Lionel Mesi’ in the text) and near options (e.g., the term ‘Lionel Messi’ should match ‘Lionel J. Messi’ in the text).
I aim to solve this problem in near real-time (1-2 second). I’ve tried using trie data structures and parallelization, but especially when the fuzzy aspect comes into play, the large size of the list and the PDF length lead to long processing times (about 30 seconds).
How should I approach this problem?
- Can I handle it on the fly with Python libraries (using parallelization, trie structures, etc.)?
- Are there features in PostgreSQL that support such searches?
- Should I use a framework like Elasticsearch?"
2
Answers
postgres offers native full-text search capabilities like https://www.postgresql.org/docs/current/textsearch.html or https://www.postgresql.org/docs/current/pgtrgm.html
and there are also extensions like the recent https://docs.paradedb.com/blog/introducing_bm25
or https://github.com/pgvector/pgvector that work on vectors generated (f.e) by LLMs.
Selam Batuhan, Elasticsearch fuzzy search can handle this query near real-time. In fact, this is the main reason why Lucene, Elasticsearch or Opensearch exist.
Have you indexed your data in elasticsearch and measured time it took for a
Lionel Messi
search?Here is a good article that explain the Damerau-Levenshtein distance. https://www.elastic.co/blog/found-fuzzy-search