I’m compiling a database of decently sized documents (anywhere from 20 to 500 characters), and I would like to be able to very quickly (under 1s) search through them.
I started with around 100k rows worth of data, and now have around 4 million. I plan to cap it at around 30 million rows.
At the beginning of the project, I wrote a quick search query using Postgres’ full text search and pg_trgm which worked extremely well from an efficiency and accuracy standpoint. At 4M rows, the full text searches are still on the order of milliseconds, but the trigram searches are taking over 2 minutes at times.
Here is the query:
SELECT text
FROM my_table
WHERE text_vector @@ websearch_to_tsquery('simple', 'a string to search')
OR text_vector @@ websearch_to_tsquery('english', 'a string to search')
OR 'a string to search' <<% text
LIMIT 100;
We store and index ts_vector
s on insertion, which allows us to utilize text_vector
directly in full text search.
The reason for adding trigram search is because I want some sort of fuzzy searching. Searching for something like "robots" with full text search works very well, but "robot contests" or "cool robot pics" will leave out a lot of relevant results since it doesn’t exactly match up. I’m less concerned with having 100% accuracy to the query, and more so concerned about giving a good general picture of the results in the database. I don’t care about returning many results either, just the 100 most similar to the query. My thought process is to let the full text search match the first n results, and fuzzy search fill in the next 100 – n.
I’ve added the appropriate indexes for the both the full text search and the trigram search based on Postgres’ recommendations.
The trigram index is obviously large, and we expect it to keep getting bigger as we add more and more values.
I’m not a Postgres expert, but from my research, I think my main issue may be that the database just isn’t using enough compute to use trigram index quickly. I’ve tried increasing stats on the database like the shared_buffers
, work_mem
, and effective_cache_size
, but for some reason that has no effect. I’m running the database on an instance with 2 CPUs and 4GB of RAM, and this database is the only thing running on it, so I would like to max out the compute.
I’m mainly asking this question to start a discussion amongst people who know more than me about Postgres, and to see what steps I could do to conduct a fuzzy search on Postgres prioritizing speed over all. I’ve also looked into extensions like pgvector to try and achieve the same result, but I think the problem persists that the index is just too big to conduct a search fast enough.
My main questions are:
- If I just "throw hardware" at the problem, can I get these searches under 1s consistently? I don’t mind getting a larger instance, as long as I know that it can handle trigram or vector search well.
- Can I achieve what I want to with the instance size (2 CPU, 4GB RAM) I have now? If so, how can I adjust my DB settings to do so?
- Is there some simple implementation trick/optimization that I can do that I’m just not aware of?
- Are the strings that I’m searching just too large to conduct a trigram/vector search fast enough?
- Is Postgres the right tool to use for this problem? If no, are there any better tools that come to mind?
If this is an involved problem, I have a few ideas on how I can speed up the search in a different way. But following the KISS approach, I’d like to try simple similarity/vector search with Postgres before diving into a much more complex solution. I’m relatively new to Postgres, and knowing that it’s been around for a while, I’d like to tap in to the experienced users and make sure I’m going about this the right way.
EDIT: Below are the definitions of the text
and text_vector
columns:
text TEXT NOT NULL
text_vector tsvector GENERATED ALWAYS AS (to_tsvector('english', "text")) STORED
And the index I used
CREATE INDEX trgm_idx ON my_table USING GIST (text gist_trgm_ops);
EDIT 2:
As requested in the comments, here are some (abridged) examples of matches on ‘robot contests’ which occur using <<%
and not @@
. Ideally, I would return these matches using @@
and full text search.
robot contests
->AI-generated images are winning art contests
robot contests
->The robots are coming to take over since this AI release
robit contests
->We won the company's hackathon contest with our new robot
I added the misspelling on purpose. Trigram does a good job with parts of words, but full text does not. I think this is the biggest issue I’d like to solve here.
2
Answers
Perhaps you just created the wrong index; hard to say, since you don’t show your object definitions.
But I’d say that with the trigram indexes, the full-text search is more or less redundant: anything that matches the full-text search should be similar enough. Since you say that you want the 100 most similar entries, you could work with a GiST index and a query as follows:
Trigram indexes don’t work well when searching against large strings unless the threshold is set very high, especially with the GiST version of the index. You could try the GIN version instead, but I still wouldn’t expect it to be awesome. There are just too many false positives that can’t be exclude without visiting them. (In tension to this, only GiST supports the KNN algorithm which is used when ordering by a distance operator such as
<<<->
. However, the performance of that algorithm itself falls apart pretty quickly when you are searching in too-large string)Based on your examples, for the first two I think what you really want is the tsquery using the OR operators (
|
) between the terms. For the third one, I think you should implement spell checking as a separate layer of your search. Have a separate dictionary table to store every word that occurs anywhere in your main table, and if a query term isn’t in that list, then you can assume it is misspelled. Then you can search the dictionary table (of short strings) using trigrams for probable matches very efficiently, and then either propose to the user that they fix it, or automatically fix it for them and show the results of that fixed query (possibly with a note that you have done so–that is how Google presents it when they fix misspellings). The problem is, do you try to store the dictionary before they are stemmed, or after? And if after, how do you stem the misspelled words to compare them to the stemmed dictionary?As for your other questions, it is going to be hard to scale enough so that just parallelization can bring you down from 2 minutes to <1 second. Or at least, once you’ve done so it will be hard to pay the bill. And I am not aware of any other tool which magically solves this. You will need an elaborate system and have to put the pieces of it together your self. Although some other tools might make it easier to put those pieces together, I’m not aware of any of them that just do it all automatically.