There is an index on requestedAtTimestamp_-1
but the query regularly takes over 5s and Mongo Atlas is shouting at us saying The ratio of documents scanned to returned exceeded 1000.0
.
Are there additional indexes that could help to speed up this aggregation?
{
"$match": {
"requestedAtTimestamp": {
"$gte": 1730419200000
}
}
},
{
"$sort": {
"requestedAtTimestamp": 1
}
},
{
"$addFields": {
"requestedAt": {
"$toDate": "$requestedAtTimestamp"
}
}
},
{
"$project": {
"user": 1,
"requestedAt": 1,
"y": {
"$year": "$requestedAt"
},
"m": {
"$month": "$requestedAt"
},
"d": {
"$dayOfMonth": "$requestedAt"
},
"h": {
"$hour": "$requestedAt"
},
"x1": 1,
"x2": 1
}
},
{
"$group": {
"_id": {
"user": "$user",
"y": "$y",
"m": "$m",
"d": "$d",
"h": "$h"
},
"x1": {
"$sum": {
"$cond": [
"$x1",
1,
0
]
}
},
"x2": {
"$sum": {
"$cond": [
"$x2",
1,
0
]
}
},
"total": {
"$sum": 1
}
}
}
Operation Execution Time 8.29 s
Examined:Returned Ratio 6,169.51
Keys Examined 623,121
Docs Returned 101
Docs Examined 623,121
Num Yields 693
Response Length 16,593
"planSummary": "IXSCAN { requestedAtTimestamp: -1 }",
"planningTimeMicros": 1520,
"cursorid": 7596442613226102117,
"keysExamined": 623121,
"docsExamined": 623121,
"numYields": 693,
"nreturned": 101,
"queryHash": "08BAB40F",
"planCacheKey": "AEE30401",
"queryFramework": "classic",
"reslen": 16593,
"locks": {
"FeatureCompatibilityVersion": {
"acquireCount": {
"r": 733
}
},
"Global": {
"acquireCount": {
"r": 733
}
}
},
"readConcern": {
"level": "local",
"provenance": "implicitDefault"
},
"writeConcern": {
"w": "majority",
"wtimeout": 0,
"provenance": "implicitDefault"
},
"storage": {
"data": {
"bytesRead": 2950570755,
"timeReadingMicros": 3032242
},
"timeWaitingMicros": {
"cache": 2352
}
},
"cpuNanos": 5626765637,
"remote": "3.238.26.30:53820",
"protocol": "op_msg",
"durationMillis": 8292,
"v": "7.0.15",
"isTruncated": false
Doc Format
{
"_id": {
"$oid": "673601b421ea2eec217ce968"
},
"__v": {
"$numberInt": "0"
},
"lastUpdatedTimestamp": {
"$numberDouble": "1731592626011.0"
},
"requestedAtTimestamp": {
"$numberDouble": "1731592624752.0"
},
"x1": "sadasdasdadasdasdasdasdasdasdasdasdasdasdasdsad",
"x2": "0x7605ffddabf72cb9fa7ebba35c9c252d5cb4064cc04a314fe34ad2eea4a19c2bff90df2e1b6228ccd136a56b49c1d4e7dde039e94ccf676bbfa00e0ea216e681",
}
2
Answers
To examine the performance of this pipeline, consider what each stage is doing:
According to the plan summary, this stage uses the index on
{ requestedAtTimestamp: -1 }
. It will actually use this index in reverse because of the sort stage, but the work done here is the same:1730419200000
in the b-treeThis stage is entirely unnecessary, since there is a group stage later on in the pipeline, and $group does not guarantee the order of the output documents even when the input is sorted.
However, since the previous $match used the index on this field, the documents were fetched already in sorted order, so this stage is effectively a no-op, anyway.
This converts the
requestedAtTimestamp
fields, which contains a double, to a BSON datetime.According to the BSON spec, a double is stored as
And a UTC datetime is
So not really much difference here, just a double to int conversion.
This stage will remove all of the original fields except for _id, user, requestedAt, x1, and x2.
The big one here is it will calculate the year, month, day and hour from the requestedAt field. Each one separately. Since this pipeline examined 623,121 documents, that means it converted the number of milliseconds to a date 4 times, for a total of 2,494,484 conversions. Also note that without specifying a timezone, these conversion will be based on UTC.
This stage accumulates some values by hour. This is the only blocking stage. The first result document cannot be returned until this stage has processed all input documents.
A note of caution here, using non-boolean field values as a $cond condition can have unexpected results, like
""
being true.Overall
The time in the query presented in the question is midnight 1 Nov UTC. If this were run on 14 Nov, that would be potentially 336 returned documents per user. This query examined 623,121 documents, and because of the $group stage, it would have had to examine all possible input documents before returning any. This means that the 101 documents returned is likely just the first batch, and the client would run a getMore to get the rest of them. That follow-up getMore request would not need to examine anything, because the result set is already complete after the $group stage, it would just need to return that data.
The biggest improvement here would be reducing the amount of work done in the $project stage.
A couple ideas for that:
This will partition the dates into hours, without needing to calculate the year, month, or day. Then after grouping, use $project to convert that to year, month, day, and hour so those calculations are done hundreds of times instead of hundreds of thousands.
I don’t think the
$addFields
and$project
contribute a lot to overall execution time, however they are not needed, thus you can skip it.I would try this one:
In principle the
$sort
stage has no function, but due to $group Performance Optimizations you should try it and check whether it makes any difference.