skip to Main Content

I have a MongoDB collection with 8k+ documents, around 40GB. Inside it, the data follows this format:

{
 _id: ...,
 _session: {
    _id: ...
 },
data: {...}
}

I need to get all the _session._id for my application. The following approach (python) takes too long to get them:

cursor = collection.find({}, projection={'_session._id': 1})

I have created an Index in MongoDB Compass, but I’m not sure if my query is making use of it at all.

Is there a way to speed this query such that I get all the _session._id very fast?

2

Answers


  1. In mongo shell you can hint() the query optimizer to use the available index as follow:

    db.collection.find({},{_id:0,"_session._id":1}).hint({"_session._id":1})
    

    Following test is confirmed to work via python:

     import pymongo
     db=pymongo.MongoClient("mongodb://user:pass@localhost:12345")
     mydb=db["test"]
     docs= mydb.test2.find( {} ).hint([ ("x.y", pymongo.ASCENDING) ])
     for i in docs:
        print(i)
    

    db.test2.createIndex({"x.y":1})

        {
            "v" : 2,
            "key" : {
              "x.y" : 1
             },
            "name" : "x.y_1"
        }
    

    python 3.7 ,
    pymongo 3.11.2 ,
    mongod 5.0.5

    In your case seems to be text index , btw it seems abit strange why session is text index , for text index somethink like this must work:

     db.test2.find({}).hint("x.y_text").explain()
    

    And here is working example with text index:

    import pymongo
    db=pymongo.MongoClient("mongodb://user:pass@localhost:123456")
    print('Get first 10 docs from test.test:')
    mydb=db["test"]
    docs= mydb.test2.find( {"x.y":"3"} ).hint( "x.y_text" )
    print("===start:====")
    for i in docs:
         print(i)
    

    db.test2.createIndex({"x.y":"text"}):

        {
        "v" : 2,
        "key" : {
            "_fts" : "text",
            "_ftsx" : 1
        },
        "name" : "x.y_text",
        "weights" : {
            "x.y" : 1
        },
        "default_language" : "english",
        "language_override" : "language",
        "textIndexVersion" : 3
    }
    
    Login or Signup to reply.
  2. There are a few points of confusion in this question and the ensuing discussion which generally come down to:

    1. What indexes are present in the environment (and why the attempts to hint it failed)
    2. When using indexing is most appropriate

    Current Indexes

    I think there are at least 5 indexes that were mentioned so far:

    1. A standard index of {"_session._id":1} mentioned originally in @R2D2’s answer.
    2. A text index on the _session._id field (mentioned in this comment)
    3. A text index on the _ts_meta.session field (mentioned in this comment)
    4. A standard index of {"x.y":1} mentioned second in @R2D2’s answer.
    5. A text index of {"x.y":"text"} mentioned at the end of @R2D2’s answer.

    Only the first of these is likely to even really be relevant to the original question. Note that the difference a text index is a specialized index that is meant for performing more advanced text searching. Such indexes are not required for simple string matching or value retrieval. But standard indexes, { '_session._id': 1}, will also store string values and are relevant here.

    What Indexing is For

    Indexes are typically useful for retrieving a small subset of results from the database. The larger that set of results becomes relative to the overall size of the collection, the less helpful using an index will become. In your situation you are looking to retrieve data from all of the documents in the collection which is why the database doesn’t consider using any index at all.

    Now it is still possible that an index could help in this situation. That would be if we used it to perform a covered query which means that the data can be retrieved from the index alone without looking at the documents themselves. In this case the database would have to scan the full index, so it is not clear that it would be faster or not. But you could certainly try. To do so you would need to follow @R2D2’s instructions, specifically by creating the index and then hinting it in the query (while also projecting out the _id field):

    db.collection.createIndex({"_session._id":1})
    db.collection.find({},{_id:0,"_session._id":1}).hint({"_session._id":1})
    

    Additional Questions

    There were two other things mentioned in the question that are important to address.

    I have created an Index in MongoDB Compass, but I’m not sure if my query is making use of it at all.

    We talked about why this was the case above. But to find out if the database is using it or not you could navigate to the Explain tab in compass to take a look. If you explain plan visualization it should indicate if the index was used. Remember that you will need to hint the index based on your query.

    Is there a way to speed this query such that I get all the _session._id very fast?

    What is your definition of "very fast" here?

    The general answer is that your operation requires scanning either all documents in the collection or a full index. There is no way to do this more efficiently based on the current schema. Therefore how fast it happens is largely going to come down to the hardware that the database is running on and it will slow down as the collection grows.

    If this operation is something that you will be running frequently or have strict performance requirements around, then it may be important to think through your intended goals to see if there are other ways of achieving them. What will you or the application be doing with this list of session IDs?

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search