skip to Main Content

A non-distributed database has many posts, posts have zero or more user-defined tags, most posts have the most_posts_have_this tag, few posts have the few_posts_have_this tag.

When querying {'tags': {'$all': ['most_posts_have_this', 'few_posts_have_this']}} the query is slow, it seems to be iterating through posts with the most_posts_have_this tag.

Is there some way to hint to MongoDB that it should be iterating through posts with the few_posts_have_this tag instead?

2

Answers


  1. Is there some way to hint to MongoDB that it should be iterating through posts with the few_posts_have_this tag instead?

    Not really. When Mongo runs an $all it is going to get all records with both tags first. You could try using two $in queries in an aggregation instead, selecting the less frequent tag first. I’m not sure if this would actually be faster (depends on how Mongo optimizes things) but could be worth a try.

    The best you can do:

    1. Make sure you have an an index on the tags field. I see in the comments you have done this.

    2. Mongo may be using the wrong index for this query. You can see which it is using with cursor.explain(). You can force it to use your tags index with hint(). First use db.collection.getIndexes() to make sure your tags index shows up as expected in the list of indexes.

    3. Using projections to return only the fields you need might speed things up. For example, depending on your use case, you might return just post IDs and then query full text for a smaller subset of the returned posts. This could speed things up because Mongo doesn’t have to manage as much intermediate data.

    4. You could also consider periodically sorting the tags array field by frequency. If the least frequent tags are first, Mongo may be able to skip further scanning for that document. It will still fetch all the matching documents, but if your tag lists are very large it could save time by skipping the later tags. See The ESR (Equality, Sort, Range) Rule for more details on optimizing your indexed fields.

    If all that’s still not fast enough and the performance of these queries is critical, you’ll need to do something more drastic:

    1. Upgrade your machine (ensure it has enough RAM to store your whole dataset, or at least your indexes, in memory)
    2. Try sharding
    3. Revisit your data model. The fastest possible result will be if you can turn this query into a covered query. This may or may not be possible on an array field.

    See Mongo’s optimizing query performance for more detail, but again, it is unlikely to help with this use case.

    Login or Signup to reply.
  2. Is there some way to hint to MongoDB that it should be iterating through posts with the few_posts_have_this tag instead?

    Short answer is no, this is due to how Mongo builds an index on an array:

    To index a field that holds an array value, MongoDB creates an index key for each element in the array

    So when you when you query the tags field imagine mongo queries each tag separately then it does an intersection.

    If you run "explain" you will be able to see that after the index scan phase Mongo executes a fetch document phase, this phase in theory should be redundant for an pure index scan which shows this is not the case. So basically Mongo fetches ALL documents that have either of the tags, only then it performs the "$all" logic in the filtering phase.


    So what can you do?

    1. if you have prior knowledge on which tag is sparser you could first query that and only then filter based on the larger tag, I’m assuming this is not really the case but worth considering if possible. If your tags are somewhat static maybe you can precalculate this even.

    Otherwise you will have to reconsider a restructuring that will allow better index usage for this usecase, I will say for most access patterns your structure is better.

    The new structure can be an object like so:

    tags2: {
      tagname1: 1,
      tagname2: 2,
      ...
    }
    

    Now if you built an index on tags2 each key of the object will be indexed separately, this will make mongo skip the "fetch" phase as the index contains all the information needed to execute the following query:

    {"tags2.most_posts_have_this" :{$exists: true}, "tags2.few_posts_have_this": {$exists: true}}
    

    I understand both solutions are underwhelming to say the least, but sadly Mongo does not excel in this specific use case.. I can think of more "hacky" approaches but I would say these 2 are the more reasonable ones to actually consider implementing depending on performance requirments.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search