skip to Main Content

We have collection with 70k documents. On prod our query executing about 2,8 minute. We trier reproduce same case on another instance, but with the same collection size and on the worse in performance instance query takes 6 sec. What it can be? Cache?
Exec:

db.collection.stats()

do not show any info about cache. Does DocumentDB cache queries?

Here executionStats from slow db:

{
    "queryPlanner": {
        "plannerVersion": 1.0,
        "namespace": "nnnnnnnn",
        "winningPlan": {
            "stage": "LIMIT_SKIP",
            "inputStage": {
                "stage": "SORT",
                "sortPattern": {
                    "_tempSortEventId": -1.0
                },
                "inputStage": {
                    "stage": "SUBSCAN",
                    "inputStage": {
                        "stage": "COLLSCAN"
                    }
                }
            }
        }
    },
    "executionStats": {
        "executionSuccess": true,
        "executionTimeMillis": "113311.697",
        "planningTimeMillis": "0.303",
        "executionStages": {
            "stage": "LIMIT_SKIP",
            "nReturned": "50",
            "executionTimeMillisEstimate": "113310.782",
            "inputStage": {
                "stage": "SORT",
                "nReturned": "50",
                "executionTimeMillisEstimate": "113310.776",
                "sortPattern": {
                    "_tempSortEventId": -1.0
                },
                "inputStage": {
                    "stage": "SUBSCAN",
                    "nReturned": "70107",
                    "executionTimeMillisEstimate": "110684.645",
                    "inputStage": {
                        "stage": "COLLSCAN",
                        "nReturned": "70107",
                        "executionTimeMillisEstimate": "67827.520",
                        "inputStage": {
                            "nReturned": "1",
                            "executionTimeMillisEstimate": "0.048"
                        }
                    }
                }
            }
        }
    },
    "serverInfo": {
        "host": "prod",
        "port": 27017.0,
        "version": "4.0.0"
    },
    "ok": 1.0,
    "operationTime": Timestamp(1670838896,1)
}

request is following:

aggregate(
    [
        { 
            "$match" : 
            { 
                "ApplicationId" : NUUID("dd25dadc-6b22-4f81-995b-2cce698a111a"), 
                "FilterKeys" : 
                { 
                    "$elemMatch" : 
                    { 
                        "Name" : "CorporateId", 
                        "Value" : "bbbfe3a7-fbec-4c88-8746-adf883a2ae6b" 
                     } 
                } 
             } 
         }, 
         { 
             "$addFields" : 
             { 
                 "_tempSortEventId" : 
                 { 
                     "$toLower" : "$EventId" 
                 } 
              } 
          }, 
          { 
              "$sort" : 
              { 
                  "_tempSortEventId" : -1 
              } 
           }, 
           { 
               "$project" : 
               { 
                   "_tempSortEventId" : 0 
               } 
           }, 
           { 
               "$skip" : 0 
           }, 
           { 
               "$limit" : 50 
           }])

2

Answers


  1. It doesn’t look that you have an index on the ApplicationId field. As you can see from the explain plan, the SUBSCAN stage is a collection scan through all 70107 docs to return one matching doc.

    Login or Signup to reply.
  2. The explain output here is very helpful, thanks for providing it.

    Indexes are the tool which database typically use to improve the efficiency and performance of queries. There are two primary opportunities for one to be used in this case:

    1. To quickly identify the documents in the collection which belong in the result set according to the filters in the $match
    2. To gather the data in the sort order requested by the client (and therefore avoiding manually sorting the data).

    Currently the database is unable to take advantage of either opportunity. Instead it is doing a full scan of all 70,107 documents in the collection along with manually sorting the result set. This is incredibly inefficient and will be expensive both in terms of resource utilization and query duration.

    At the very least, consider creating a compound index on { "ApplicationId": 1, "FilterKeys.Name": 1, "FilterKeys.Value" : 1 }. Such an index should allow the database to retrieve and process only the documents that are relevant to the $match filters.

    I would also recommend pursuing some changes to allow the database to also use the index for the $sort, though it requires a bit more effort in this particular case. As currently written, the sorting logic is as follows:

             { 
                 "$addFields" : 
                 { 
                     "_tempSortEventId" : 
                     { 
                         "$toLower" : "$EventId" 
                     } 
                  } 
              }, 
              { 
                  "$sort" : 
                  { 
                      "_tempSortEventId" : -1 
                  } 
               },
    

    Specifically the sort is being done on a value that is being dynamically calculated during execution. The most straightforward way to do this would be to persist this value to the documents and include that field in the index directly. Assuming the same field name, the index definition would be:

    { "ApplicationId": 1, "FilterKeys.Name": 1, "FilterKeys.Value" : 1, "_tempSortEventId": -1 }
    

    And the slightly modified aggregation (removing the $addFields that generates the sorting field) might look something like this:

    aggregate(
        [
            { 
                "$match" : 
                { 
                    "ApplicationId" : NUUID("dd25dadc-6b22-4f81-995b-2cce698a111a"), 
                    "FilterKeys" : 
                    { 
                        "$elemMatch" : 
                        { 
                            "Name" : "CorporateId", 
                            "Value" : "bbbfe3a7-fbec-4c88-8746-adf883a2ae6b" 
                         } 
                    } 
                 } 
             },
             { 
                  "$sort" : 
                  { 
                      "_tempSortEventId" : -1 
                  } 
              }, 
              { 
                   "$project" : 
                   { 
                       "_tempSortEventId" : 0 
                   } 
               }, 
               { 
                   "$skip" : 0 
               }, 
               { 
                   "$limit" : 50 
               }
        ]
    )
    

    The tradeoff for keeping this bit of additional information is that the database would be able to do all of the following:

    1. Navigate to the key in the index that represents the first matching result that the client is requesting.
    2. Walk the index in the same order that the client requested the data get returned in.
    3. Find only the documents that are part of the results.
    4. Stop its search after identifying exactly 50 matching results (assuming the same values for $skip and $limit that are present in the question).

    This would be as efficient as it can get. It would likely solve any performance issues that you are currently encountering as well as set you up for better success scaling your application in the future.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search