While searching for entries in a mongodb instance using the text indexing function of mongodb, I seem to receive results which contain any of the words in the input string. So for example if I search for ‘google seo’, it’d return results for google seo, google, and seo. I only need it to return results which have the entire string or atleast both of them in the sentence. so results like ‘Why should I google seo’, ‘What is google seo’, ‘What does google have to do with seo’ etc. should return. Any combination of the following would be perfect.
I can currently mitigate the entire issue by just using a mongodb regex but that is way slower than the index search as I have over 250m entires. As a test, index searches took on average 1.72s whilst the regex searches took over 27.23s. I want the speed of the index searches with even just half the accuracy of regex searches as if the user can search quicker, it doesn’t really matter if the results aren’t the most accurate. Also programmatically creating regex searches to match all words in a string if they are just located in the input string anywhere. e.g. for me to return results which contain the words ‘google’ and ‘seo’ in the same sentence, it is alot of unnecessary code which also isnt 100% accurate.
The current data base schema is as follows
{
_id: 0000000000,
search_string: string,
difficulty: number,
clicks: number,
volume: number,
keyword: string
}
The backend is a NodeJS server.
Any help is appreciated.
Thanks!
2
Answers
What worked for me
Running any sort of regex on possibly hundreds of thousands of datapoints will always be very time and resource intensive. Also doing it natively with mongodb means that data is not sent in chunks / asynchronously ( at least as far as my knowledge extends).
Instead there are two approaches that can decrease either time, server resources or bandwidth usage.
These were the best solutions for me. There might be better native techniques but over the span of week, I wasn't able to find any better ( faster ) solutions.
EDIT
I feel it's kind of necessary to elaborate on what kind of processing does the data undergo before it is sent out and exactly how I do it.Currently I have a database of around 250m entries. Each entry having the schema described in the question. The average query would usually be something like 'who is putin', 'android', 'iphone 13' etc. The database is made up of 12 collections for each 'major' keyword (what, why, should, how, when, which, who etc.) so the query is first stripped of those. So if the query was 'who is putin' it is converted to just 'is putin'. For cases where there is no keyword, all collections are checked. If there is a better way. let me know
After that we send the query to the database and retrieve the results. the query undergoes another function afterwards which rids it of 'unnecessary' words so words like is, if, a, be etc. are also removed and it returns an array of the major words so a query like 'What is going on between Russia and Ukraine' gets converted to
['going', 'between', 'Russia', 'Ukraine']
. As the results are received, we go over each of them to see if they include all the words from the array and whichever do, are returned to the client. Pretty basic operation here as we don't care about cases, spaces and so on. Simply uses the jscontains()
method.The average times that I get while retrieving a query with precisely 2,344,123 results takes around 2.12s cold to return the first results and just over 8.32s cold to end. Running the same query again reduces times to around .84s warm and 1.98s warm to finish (cold for first time and warm for subsequent requests).
Would combining the two approaches (text search and a regex) work?
No playground link since this needs a text index to demonstrate, but consider the following sample documents:
As described in the question and noted in the documentation, a search on
'google seo'
returns all documents that match at least one of those terms (3 of the 4 in this sample data):If we expand the query predicates to also include regexes on both of the terms via the
$all
operator, the results are narrowed down to just the single document:It also works if the words are out of order as we’d expect:
The database first selects the candidate documents using the text index and then performs the final filtering via the regex prior to returning them to the client.
Alternatively if you are using Atlas you might look into the Atlas Search functionality. Seems like
must
orfilter
would satisfy this use-case as well (reference).