skip to Main Content

There is an index on requestedAtTimestamp_-1 but the query regularly takes over 5s and Mongo Atlas is shouting at us saying The ratio of documents scanned to returned exceeded 1000.0.

Are there additional indexes that could help to speed up this aggregation?

      {
        "$match": {
          "requestedAtTimestamp": {
            "$gte": 1730419200000
          }
        }
      },
      {
        "$sort": {
          "requestedAtTimestamp": 1
        }
      },
      {
        "$addFields": {
          "requestedAt": {
            "$toDate": "$requestedAtTimestamp"
          }
        }
      },
      {
        "$project": {
          "user": 1,
          "requestedAt": 1,
          "y": {
            "$year": "$requestedAt"
          },
          "m": {
            "$month": "$requestedAt"
          },
          "d": {
            "$dayOfMonth": "$requestedAt"
          },
          "h": {
            "$hour": "$requestedAt"
          },
          "x1": 1,
          "x2": 1
        }
      },
      {
        "$group": {
          "_id": {
            "user": "$user",
            "y": "$y",
            "m": "$m",
            "d": "$d",
            "h": "$h"
          },
          "x1": {
            "$sum": {
              "$cond": [
                "$x1",
                1,
                0
              ]
            }
          },
          "x2": {
            "$sum": {
              "$cond": [
                "$x2",
                1,
                0
              ]
            }
          },
          "total": {
            "$sum": 1
          }
        }
      }
Operation Execution Time   8.29 s
Examined:Returned Ratio    6,169.51
Keys Examined              623,121
Docs Returned              101
Docs Examined              623,121
Num Yields                 693
Response Length            16,593

  "planSummary": "IXSCAN { requestedAtTimestamp: -1 }",
  "planningTimeMicros": 1520,
  "cursorid": 7596442613226102117,
  "keysExamined": 623121,
  "docsExamined": 623121,
  "numYields": 693,
  "nreturned": 101,
  "queryHash": "08BAB40F",
  "planCacheKey": "AEE30401",
  "queryFramework": "classic",
  "reslen": 16593,
  "locks": {
    "FeatureCompatibilityVersion": {
      "acquireCount": {
        "r": 733
      }
    },
    "Global": {
      "acquireCount": {
        "r": 733
      }
    }
  },
  "readConcern": {
    "level": "local",
    "provenance": "implicitDefault"
  },
  "writeConcern": {
    "w": "majority",
    "wtimeout": 0,
    "provenance": "implicitDefault"
  },
  "storage": {
    "data": {
      "bytesRead": 2950570755,
      "timeReadingMicros": 3032242
    },
    "timeWaitingMicros": {
      "cache": 2352
    }
  },
  "cpuNanos": 5626765637,
  "remote": "3.238.26.30:53820",
  "protocol": "op_msg",
  "durationMillis": 8292,
  "v": "7.0.15",
  "isTruncated": false

Doc Format

{
  "_id": {
    "$oid": "673601b421ea2eec217ce968"
  },
  "__v": {
    "$numberInt": "0"
  },
  "lastUpdatedTimestamp": {
    "$numberDouble": "1731592626011.0"
  },
  "requestedAtTimestamp": {
    "$numberDouble": "1731592624752.0"
  },
  "x1": "sadasdasdadasdasdasdasdasdasdasdasdasdasdasdsad",
  "x2": "0x7605ffddabf72cb9fa7ebba35c9c252d5cb4064cc04a314fe34ad2eea4a19c2bff90df2e1b6228ccd136a56b49c1d4e7dde039e94ccf676bbfa00e0ea216e681",
}

2

Answers


  1. To examine the performance of this pipeline, consider what each stage is doing:

    1. Match
          {"$match": {
              "requestedAtTimestamp": {"$gte": 1730419200000}
          }},
    

    According to the plan summary, this stage uses the index on { requestedAtTimestamp: -1 }. It will actually use this index in reverse because of the sort stage, but the work done here is the same:

    • Find 1730419200000 in the b-tree
    • Iterate through the rest of the values in the index
    • For each value found, fetch the corresponding document
    1. Sort
          {"$sort": {"requestedAtTimestamp": 1}},
    

    This stage is entirely unnecessary, since there is a group stage later on in the pipeline, and $group does not guarantee the order of the output documents even when the input is sorted.
    However, since the previous $match used the index on this field, the documents were fetched already in sorted order, so this stage is effectively a no-op, anyway.

    1. AddFields
          {"$addFields": {"requestedAt": { "$toDate": "$requestedAtTimestamp"}}},
    

    This converts the requestedAtTimestamp fields, which contains a double, to a BSON datetime.
    According to the BSON spec, a double is stored as

    0x01 + field name + 64-bit double value
    

    And a UTC datetime is

    0x09 + field name + unsigned 64-bin integer milliseconds since epoch
    

    So not really much difference here, just a double to int conversion.

    1. Project
         {
            "$project": {
              "user": 1,
              "requestedAt": 1,
              "y": {"$year": "$requestedAt"},
              "m": {"$month": "$requestedAt"},
              "d": {"$dayOfMonth": "$requestedAt"},
              "h": {"$hour": "$requestedAt"},
              "x1": 1,
              "x2": 1
         }},
    

    This stage will remove all of the original fields except for _id, user, requestedAt, x1, and x2.
    The big one here is it will calculate the year, month, day and hour from the requestedAt field. Each one separately. Since this pipeline examined 623,121 documents, that means it converted the number of milliseconds to a date 4 times, for a total of 2,494,484 conversions. Also note that without specifying a timezone, these conversion will be based on UTC.

    1. Group
     {"$group": {
              "_id": {
                 "user": "$user",
                 "y": "$y",
                 "m": "$m",
                 "d": "$d",
                 "h": "$h"
              },
              "x1": {"$sum": {"$cond": ["$x1",1,0]}},
              "x2": {"$sum": {"$cond": ["$x2",1,0]}},
              "total": {"$sum": 1}
      }}
    

    This stage accumulates some values by hour. This is the only blocking stage. The first result document cannot be returned until this stage has processed all input documents.
    A note of caution here, using non-boolean field values as a $cond condition can have unexpected results, like "" being true.

    Overall

    The time in the query presented in the question is midnight 1 Nov UTC. If this were run on 14 Nov, that would be potentially 336 returned documents per user. This query examined 623,121 documents, and because of the $group stage, it would have had to examine all possible input documents before returning any. This means that the 101 documents returned is likely just the first batch, and the client would run a getMore to get the rest of them. That follow-up getMore request would not need to examine anything, because the result set is already complete after the $group stage, it would just need to return that data.

    The biggest improvement here would be reducing the amount of work done in the $project stage.

    A couple ideas for that:

    • Project the year, month, day, and hour all at once with $dateToString, so only 1 calculation is done per document instead of 4. Perhaps
    $dateToString: {
            date: {$toDate: "$requestedAtTimestamp"},
            format: "%Y-%m-%dT%H",
            timezone: "UTC"
    }}
    
    • Instead of using $project before the $group, for the _id value in the $group, use the following:
       _id:{
            user: "$user",
            hour: {$divide:["$requestedAtTimestamp",3600000]}
           }
    

    This will partition the dates into hours, without needing to calculate the year, month, or day. Then after grouping, use $project to convert that to year, month, day, and hour so those calculations are done hundreds of times instead of hundreds of thousands.

    Login or Signup to reply.
  2. I don’t think the $addFields and $project contribute a lot to overall execution time, however they are not needed, thus you can skip it.

    I would try this one:

    { $match: { requestedAtTimestamp: { $gte: 1730419200000 } } },
    { $group: {
        _id: {
          user: "$user",
          hour: { $dateTrunc: {
            date: {$toDate: {"$requestedAtTimestamp"}},
            unit: "hour"}
          }
        },
        x1: { $sum: { $cond: [ "$x1", 1, 0 ] } },
        x2: { $sum: { $cond: [ "$x2", 1, 0 ] } },
        total: { $count: {} }
        }
    }
    

    In principle the $sort stage has no function, but due to $group Performance Optimizations you should try it and check whether it makes any difference.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search