I have a large data set (hundreds of millions) with documents like then below running in Mongo 7.0. The documents are much larger but these are the fields related to this question.
{
"groupId": "12345",
"actionPerformed": "someAction",
"date": "2023-08-17T18:16:58.000Z" // stored as ISO Dates
}
I want to group them by groupId
and get the most recent action performed within each group.
Is it more efficient for my aggregation to $sort
on the date
field and then use $first in the group stage, or to use the $top aggregation accumulator in the group stage to do the sorting for me? Both are properly covered by my index, I just can’t find any documentation around optimization on this topic.
2
Answers
1). $sort + $group with $first .
This approach involves sorting the documents based on the date field in descending order and then using $group with $first to obtain the most recent action for each groupId.
This approache should work and produce the desired result. The efficiency can depend on factors such as the size of the dataset, the distribution of data, and the specific characteristics of your MongoDB instance
Use the first approach of
$sort
on thedate
field and then use$first
in the$group
stage. But the key to making this work is that you need to make it a compound sort. Let’s explore!Setup and Testing
For reference, the two pipelines that we are looking at are:
$sort
+$group
with$first
:$group
with$top
:Note that we are using a compound sort for the first pipeline but not the second. The motivation for using the compound sort with the first pipeline will become known shortly. At the time of writing, the behavior of the second pipeline described in this answer is the same regardless if the
sortBy
uses the single or compound sort.Now we will create two compound indexes (for completeness) to see how the database is able to leverage them when executing the operation. These are:
The standard way to check the efficiency of an operation is to inspect the associated explain plan. We gather those by running
db.foo.aggregate(<pipeline>).explain()
. When we look at the output for the first pipeline, we can see the following nested inside of thewinningPlan
:By contrast, the output from the second pipeline shows the following in the
winningPlan
:"Is it more efficient?"
As we see above, the first pipeline is able to use an index to identify the document for each group while the latter pipeline opts to do a full collection scan instead. In the most broad sense, index scans are more efficient than doing collection scans. By extension the answer to your question about which approach would be more efficient is the first pipeline.
But there is always nuance for questions like this. The important thing in this particular case is the cardinality of the data, specifically the
groupId
field here. If every document effectively had a unique value forgroupId
then theCOLLSCAN
plan would actually be more efficient. This is because the other plan requires scanning the entire index plus accessing all of the documents in the collection which is more work than ‘just’ scanning the full collection to begin with.The more documents there are per
groupId
value, the larger the efficiency improvement the first pipeline using the index will be compared to the second pipeline doing the collection scan.Explanation
So what is happening here?
Indexes are ordered data structures. For the compound index of
{ groupId: 1, date: -1 }
, the keys will be ordered bygroupId
first and then bydate
secondarily. By extension, the database then knows that all of the keys associated with eachgroupId
will be located sequentially in the index. Moreover, it also knows that once it reads the key associated with agroupId
that there will be nothing else of interest in the index for the query until the section that contains the nextgroupId
begins.This knowledge of the index structure is what allows the optimizer to construct the
DISTINCT_SCAN
plan that leverages the index. But it can only do so when the index can satisfy the requested sort which is why we had to extend it to make it a compound sort even though it will not logically impact the results.Hypothetically the second pipeline could do something similar, but as the explain plans show this (potentially trickier) optimization is presently not implemented. So what we are doing with the first pipeline is helping guide the optimizer toward a more efficient plan by giving it a ‘simpler’ request to process.
Supporting Material
There are a couple of places in the official documentation that references this optimization. One is here which describes the approach overall:
Separately, the requirements for usage of this optimization are outlined under the "
$group
stage" portion of this section of the documentation:$group
stage only uses the$first
or$last
accumulator operator.You can use this playground as a beginning place to play around with this concept further.
Additional Consideration
In your description you mentioned "these are the fields related to this question." In there you included a field named
"actionPerformed"
but you never actually referenced it elsewhere in your question.If you happen to be using that field in your aggregation logic as well, such as to filter out documents which do not match
"someAction"
, then you may need to include that field in your index definition as well.