MongoDB aggregation $group stage $top vs $first efficiency over a large data set

canpan14
October 3, 2023
170 views
1 vote
2 Answers

I have a large data set (hundreds of millions) with documents like then below running in Mongo 7.0. The documents are much larger but these are the fields related to this question.

{
 "groupId": "12345",
 "actionPerformed": "someAction",
 "date": "2023-08-17T18:16:58.000Z" // stored as ISO Dates
}

I want to group them by groupId and get the most recent action performed within each group.

Is it more efficient for my aggregation to $sort on the date field and then use $first in the group stage, or to use the $top aggregation accumulator in the group stage to do the sorting for me? Both are properly covered by my index, I just can’t find any documentation around optimization on this topic.

Tags: aggregation-framework mongodb

Answers

- SATYAMKUMAR
- September 28, 2023 at 8:14 pm
- 0 votes
0
1). $sort + $group with $first .

This approach involves sorting the documents based on the date field in descending order and then using $group with $first to obtain the most recent action for each groupId.
```
db.coll.aggregate([
  {
    $sort: { date: -1 }  // Sort by date in descending order
  },
  {
    $group: {
      _id: "$groupId",
      mostRecentAction: { $first: "$actionPerformed" },
      mostRecentDate: { $first: "$date" }
    }
  }
]);
```
This approache should work and produce the desired result. The efficiency can depend on factors such as the size of the dataset, the distribution of data, and the specific characteristics of your MongoDB instance
Login or Signup to reply.

- user20042973
- October 3, 2023 at 7:29 pm
- 0 votes
0
Use the first approach of $sort on the date field and then use $first in the $group stage. But the key to making this work is that you need to make it a compound sort. Let’s explore!

Setup and Testing

For reference, the two pipelines that we are looking at are:
1. $sort + $group with $first:
```
[
  { '$sort': { groupId: 1, date: -1 } },
  { '$group': { _id: '$groupId', doc: { '$first': '$$ROOT' } } }
]
```
1. $group with $top:
```
[
  {
    '$group': {
      _id: '$groupId',
      doc: { '$top': { output: '$$ROOT', sortBy: { date: -1 } } }
    }
  }
]
```
Note that we are using a compound sort for the first pipeline but not the second. The motivation for using the compound sort with the first pipeline will become known shortly. At the time of writing, the behavior of the second pipeline described in this answer is the same regardless if the sortBy uses the single or compound sort.

Now we will create two compound indexes (for completeness) to see how the database is able to leverage them when executing the operation. These are:
```
  { groupId: 1, date: -1 }
  { date: -1, groupId: 1 }
```
The standard way to check the efficiency of an operation is to inspect the associated explain plan. We gather those by running db.foo.aggregate(<pipeline>).explain(). When we look at the output for the first pipeline, we can see the following nested inside of the winningPlan:
```
stage: 'DISTINCT_SCAN',
keyPattern: { groupId: 1, date: -1 },
```
By contrast, the output from the second pipeline shows the following in the winningPlan:
```
stage: 'COLLSCAN',
```
"Is it more efficient?"

As we see above, the first pipeline is able to use an index to identify the document for each group while the latter pipeline opts to do a full collection scan instead. In the most broad sense, index scans are more efficient than doing collection scans. By extension the answer to your question about which approach would be more efficient is the first pipeline.

But there is always nuance for questions like this. The important thing in this particular case is the cardinality of the data, specifically the groupId field here. If every document effectively had a unique value for groupId then the COLLSCAN plan would actually be more efficient. This is because the other plan requires scanning the entire index plus accessing all of the documents in the collection which is more work than ‘just’ scanning the full collection to begin with.

The more documents there are per groupId value, the larger the efficiency improvement the first pipeline using the index will be compared to the second pipeline doing the collection scan.

Explanation

So what is happening here?

Indexes are ordered data structures. For the compound index of { groupId: 1, date: -1 }, the keys will be ordered by groupId first and then by date secondarily. By extension, the database then knows that all of the keys associated with each groupId will be located sequentially in the index. Moreover, it also knows that once it reads the key associated with a groupId that there will be nothing else of interest in the index for the query until the section that contains the next groupId begins.

This knowledge of the index structure is what allows the optimizer to construct the DISTINCT_SCAN plan that leverages the index. But it can only do so when the index can satisfy the requested sort which is why we had to extend it to make it a compound sort even though it will not logically impact the results.

Hypothetically the second pipeline could do something similar, but as the explain plans show this (potentially trickier) optimization is presently not implemented. So what we are doing with the first pipeline is helping guide the optimizer toward a more efficient plan by giving it a ‘simpler’ request to process.

Supporting Material

There are a couple of places in the official documentation that references this optimization. One is here which describes the approach overall:

If a pipeline sorts and groups by the same field and the $group stage only uses the $first or $last accumulator operator, consider adding an index on the grouped field which matches the sort order. In some cases, the $group stage can use the index to quickly find the first or last document of each group.

Separately, the requirements for usage of this optimization are outlined under the "$group stage" portion of this section of the documentation:

if the stage meets both of these conditions:
- The pipeline sorts and groups by the same field.
- The $group stage only uses the $first or $last accumulator operator.
You can use this playground as a beginning place to play around with this concept further.

Additional Consideration

In your description you mentioned "these are the fields related to this question." In there you included a field named "actionPerformed" but you never actually referenced it elsewhere in your question.

If you happen to be using that field in your aggregation logic as well, such as to filter out documents which do not match "someAction", then you may need to include that field in your index definition as well.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

MongoDB aggregation $group stage $top vs $first efficiency over a large data set

Answers

Setup and Testing

"Is it more efficient?"

Explanation

Supporting Material

Additional Consideration