Efficient Text Search for Large Term List: Python vs PostgreSQL vs Elasticsearch

BatuhanB
January 9, 2024
256 views
0 votes
2 Answers

I have a list containing terms that vary in length from 1 to 10 words, with approximately 500,000 entries. My goal is to search for these terms in a long text (converted from a PDF, typically 1.5 to 2 pages long). I need to perform the searches not only as exact matches but also using fuzzy (e.g., the term ‘Lionel Messi’ should match ‘Lionel Mesi’ in the text) and near options (e.g., the term ‘Lionel Messi’ should match ‘Lionel J. Messi’ in the text).

I aim to solve this problem in near real-time (1-2 second). I’ve tried using trie data structures and parallelization, but especially when the fuzzy aspect comes into play, the large size of the list and the PDF length lead to long processing times (about 30 seconds).

How should I approach this problem?

Can I handle it on the fly with Python libraries (using parallelization, trie structures, etc.)?
Are there features in PostgreSQL that support such searches?
Should I use a framework like Elasticsearch?"

Answers

- FlorianKlein
- January 5, 2024 at 2:37 pm
- 0 votes
0
postgres offers native full-text search capabilities like https://www.postgresql.org/docs/current/textsearch.html or https://www.postgresql.org/docs/current/pgtrgm.html

and there are also extensions like the recent https://docs.paradedb.com/blog/introducing_bm25
or https://github.com/pgvector/pgvector that work on vectors generated (f.e) by LLMs.

Login or Signup to reply.

- MusabDogan
- January 9, 2024 at 10:01 am
- 0 votes
0
Selam Batuhan, Elasticsearch fuzzy search can handle this query near real-time. In fact, this is the main reason why Lucene, Elasticsearch or Opensearch exist.

Have you indexed your data in elasticsearch and measured time it took for a Lionel Messi search?

Here is a good article that explain the Damerau-Levenshtein distance. https://www.elastic.co/blog/found-fuzzy-search

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.