skip to Main Content

I have a list containing terms that vary in length from 1 to 10 words, with approximately 500,000 entries. My goal is to search for these terms in a long text (converted from a PDF, typically 1.5 to 2 pages long). I need to perform the searches not only as exact matches but also using fuzzy (e.g., the term ‘Lionel Messi’ should match ‘Lionel Mesi’ in the text) and near options (e.g., the term ‘Lionel Messi’ should match ‘Lionel J. Messi’ in the text).

I aim to solve this problem in near real-time (1-2 second). I’ve tried using trie data structures and parallelization, but especially when the fuzzy aspect comes into play, the large size of the list and the PDF length lead to long processing times (about 30 seconds).

How should I approach this problem?

  1. Can I handle it on the fly with Python libraries (using parallelization, trie structures, etc.)?
  2. Are there features in PostgreSQL that support such searches?
  3. Should I use a framework like Elasticsearch?"

2

Answers


  1. postgres offers native full-text search capabilities like https://www.postgresql.org/docs/current/textsearch.html or https://www.postgresql.org/docs/current/pgtrgm.html

    and there are also extensions like the recent https://docs.paradedb.com/blog/introducing_bm25
    or https://github.com/pgvector/pgvector that work on vectors generated (f.e) by LLMs.

    Login or Signup to reply.
  2. Selam Batuhan, Elasticsearch fuzzy search can handle this query near real-time. In fact, this is the main reason why Lucene, Elasticsearch or Opensearch exist.

    Have you indexed your data in elasticsearch and measured time it took for a Lionel Messi search?

    Here is a good article that explain the Damerau-Levenshtein distance. https://www.elastic.co/blog/found-fuzzy-search

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search