skip to Main Content

While searching for entries in a mongodb instance using the text indexing function of mongodb, I seem to receive results which contain any of the words in the input string. So for example if I search for ‘google seo’, it’d return results for google seo, google, and seo. I only need it to return results which have the entire string or atleast both of them in the sentence. so results like ‘Why should I google seo’, ‘What is google seo’, ‘What does google have to do with seo’ etc. should return. Any combination of the following would be perfect.

I can currently mitigate the entire issue by just using a mongodb regex but that is way slower than the index search as I have over 250m entires. As a test, index searches took on average 1.72s whilst the regex searches took over 27.23s. I want the speed of the index searches with even just half the accuracy of regex searches as if the user can search quicker, it doesn’t really matter if the results aren’t the most accurate. Also programmatically creating regex searches to match all words in a string if they are just located in the input string anywhere. e.g. for me to return results which contain the words ‘google’ and ‘seo’ in the same sentence, it is alot of unnecessary code which also isnt 100% accurate.

The current data base schema is as follows

{
    _id: 0000000000,
    search_string: string,
    difficulty: number,
    clicks: number,
    volume: number,
    keyword: string
 }

The backend is a NodeJS server.

Any help is appreciated.
Thanks!

2

Answers


  1. Chosen as BEST ANSWER

    What worked for me

    Running any sort of regex on possibly hundreds of thousands of datapoints will always be very time and resource intensive. Also doing it natively with mongodb means that data is not sent in chunks / asynchronously ( at least as far as my knowledge extends).
    Instead there are two approaches that can decrease either time, server resources or bandwidth usage.

  2. Using the server to process the data before sending it over.This might seem obvious but if you have the hardware overhead to perform such an operation on the server end than it is much better and faster to run string comparisons server side and send the data back in chunks to be lazy loaded into your app. For me this decreased average search times from over 29.3s to just below 2.23s on average with a sample size of a database of 250m entries, 80k results per search and around 10k-15k filtered results.
  3. If you don't have the processing overhead and are willing to sacrifice on bandwidth and user experience, then doing this on the client side isn't out of the equation especially considering just how capable modern hardware is. This provides a bit more flexibility such that all the data can be shown with the relevant data being shown first and the other 'irrelevent data' being shown last. This does need to be well optimized and be implemented with the supposition that your app would mostly be run on modern hardware or preferably not on mobile devices.

  4. These were the best solutions for me. There might be better native techniques but over the span of week, I wasn't able to find any better ( faster ) solutions.

    EDIT

    I feel it's kind of necessary to elaborate on what kind of processing does the data undergo before it is sent out and exactly how I do it.
    Currently I have a database of around 250m entries. Each entry having the schema described in the question. The average query would usually be something like 'who is putin', 'android', 'iphone 13' etc. The database is made up of 12 collections for each 'major' keyword (what, why, should, how, when, which, who etc.) so the query is first stripped of those. So if the query was 'who is putin' it is converted to just 'is putin'. For cases where there is no keyword, all collections are checked. If there is a better way. let me know
    After that we send the query to the database and retrieve the results. the query undergoes another function afterwards which rids it of 'unnecessary' words so words like is, if, a, be etc. are also removed and it returns an array of the major words so a query like 'What is going on between Russia and Ukraine' gets converted to ['going', 'between', 'Russia', 'Ukraine'] . As the results are received, we go over each of them to see if they include all the words from the array and whichever do, are returned to the client. Pretty basic operation here as we don't care about cases, spaces and so on. Simply uses the js contains() method.
    The average times that I get while retrieving a query with precisely 2,344,123 results takes around 2.12s cold to return the first results and just over 8.32s cold to end. Running the same query again reduces times to around .84s warm and 1.98s warm to finish (cold for first time and warm for subsequent requests).

  • Would combining the two approaches (text search and a regex) work?

    No playground link since this needs a text index to demonstrate, but consider the following sample documents:

    test> db.foo.find()
    [  
      { _id: 1, val: 'google seo' },   
      { _id: 2, val: 'google ' },     
      { _id: 3, val: 'seo random ' },  
      { _id: 4, val: 'none' }       
    ]
    

    As described in the question and noted in the documentation, a search on 'google seo' returns all documents that match at least one of those terms (3 of the 4 in this sample data):

    test> db.foo.find({$text:{$search:'google seo'}})    
    [                                                                                                                         
      { _id: 2, val: 'google ' },                                                                                             
      { _id: 1, val: 'google seo' },                                                                                          
      { _id: 3, val: 'seo random ' }
    ]
    

    If we expand the query predicates to also include regexes on both of the terms via the $all operator, the results are narrowed down to just the single document:

    test> db.foo.find({$text:{$search:'google seo'}, val:{$all:[/google/i, /seo/i]}})                                       
    [ 
      { _id: 1, val: 'google seo' } 
    ]
    

    It also works if the words are out of order as we’d expect:

    test> db.foo.insert({_id:5, val:'seo out of order google string'})
    { acknowledged: true, insertedIds: { '0': 5 } }
    test> db.foo.find({$text:{$search:'google seo'}, val:{$all:[/google/i, /seo/i]}})
    [
      { _id: 1, val: 'google seo' },
      { _id: 5, val: 'seo out of order google string' }
    ]
    

    The database first selects the candidate documents using the text index and then performs the final filtering via the regex prior to returning them to the client.

    Alternatively if you are using Atlas you might look into the Atlas Search functionality. Seems like must or filter would satisfy this use-case as well (reference).

    Login or Signup to reply.
  • Please signup or login to give your own answer.
    Back To Top
    Search