Mongodb index based text searches to match full string - SEO

MuhammadYahyaWarraich
October 10, 2022
150 views
0 votes
2 Answers

While searching for entries in a mongodb instance using the text indexing function of mongodb, I seem to receive results which contain any of the words in the input string. So for example if I search for ‘google seo’, it’d return results for google seo, google, and seo. I only need it to return results which have the entire string or atleast both of them in the sentence. so results like ‘Why should I google seo’, ‘What is google seo’, ‘What does google have to do with seo’ etc. should return. Any combination of the following would be perfect.

I can currently mitigate the entire issue by just using a mongodb regex but that is way slower than the index search as I have over 250m entires. As a test, index searches took on average 1.72s whilst the regex searches took over 27.23s. I want the speed of the index searches with even just half the accuracy of regex searches as if the user can search quicker, it doesn’t really matter if the results aren’t the most accurate. Also programmatically creating regex searches to match all words in a string if they are just located in the input string anywhere. e.g. for me to return results which contain the words ‘google’ and ‘seo’ in the same sentence, it is alot of unnecessary code which also isnt 100% accurate.

The current data base schema is as follows

{
    _id: 0000000000,
    search_string: string,
    difficulty: number,
    clicks: number,
    volume: number,
    keyword: string
 }

The backend is a NodeJS server.

Any help is appreciated.
Thanks!

Answers

Chosen as BEST ANSWER
- MuhammadYahyaWarraich
- October 22, 2022 at 4:41 pm
- 0 votes
0
What worked for me

Running any sort of regex on possibly hundreds of thousands of datapoints will always be very time and resource intensive. Also doing it natively with mongodb means that data is not sent in chunks / asynchronously ( at least as far as my knowledge extends).
Instead there are two approaches that can decrease either time, server resources or bandwidth usage.
Using the server to process the data before sending it over.This might seem obvious but if you have the hardware overhead to perform such an operation on the server end than it is much better and faster to run string comparisons server side and send the data back in chunks to be lazy loaded into your app. For me this decreased average search times from over 29.3s to just below 2.23s on average with a sample size of a database of 250m entries, 80k results per search and around 10k-15k filtered results.
If you don't have the processing overhead and are willing to sacrifice on bandwidth and user experience, then doing this on the client side isn't out of the equation especially considering just how capable modern hardware is. This provides a bit more flexibility such that all the data can be shown with the relevant data being shown first and the other 'irrelevent data' being shown last. This does need to be well optimized and be implemented with the supposition that your app would mostly be run on modern hardware or preferably not on mobile devices.

EDIT

['going', 'between', 'Russia', 'Ukraine']

contains()

(Edit)

Would combining the two approaches (text search and a regex) work?

No playground link since this needs a text index to demonstrate, but consider the following sample documents:

test> db.foo.find()
[  
  { _id: 1, val: 'google seo' },   
  { _id: 2, val: 'google ' },     
  { _id: 3, val: 'seo random ' },  
  { _id: 4, val: 'none' }       
]

As described in the question and noted in the documentation, a search on 'google seo' returns all documents that match at least one of those terms (3 of the 4 in this sample data):

test> db.foo.find({$text:{$search:'google seo'}})    
[                                                                                                                         
  { _id: 2, val: 'google ' },                                                                                             
  { _id: 1, val: 'google seo' },                                                                                          
  { _id: 3, val: 'seo random ' }
]

If we expand the query predicates to also include regexes on both of the terms via the $all operator, the results are narrowed down to just the single document:

test> db.foo.find({$text:{$search:'google seo'}, val:{$all:[/google/i, /seo/i]}})                                       
[ 
  { _id: 1, val: 'google seo' } 
]

It also works if the words are out of order as we’d expect:

test> db.foo.insert({_id:5, val:'seo out of order google string'})
{ acknowledged: true, insertedIds: { '0': 5 } }
test> db.foo.find({$text:{$search:'google seo'}, val:{$all:[/google/i, /seo/i]}})
[
  { _id: 1, val: 'google seo' },
  { _id: 5, val: 'seo out of order google string' }
]

The database first selects the candidate documents using the text index and then performs the final filtering via the regex prior to returning them to the client.

Alternatively if you are using Atlas you might look into the Atlas Search functionality. Seems like must or filter would satisfy this use-case as well (reference).

Please signup or login to give your own answer.

Click here to cancel reply.

Mongodb index based text searches to match full string – SEO

Answers

What worked for me

EDIT