Let’s say I have a few million documents in a MongoDB collection that look like this:
[
{
"timestamp": "2024-01-01T00:00:00Z",
"schema": "1.0.0",
"value": 3,
},
{
"timestamp": "2024-01-01T01:00:00Z",
"schema": "1.2.0",
"value": -10,
},
...
]
Now, I want to do the following, using an aggregation pipeline:
- Group documents by day (this is no problem and can be solved like here: https://www.mongodb.com/community/forums/t/group-by-time-interval-and-offset-interval/202949)
- Within each bucket, find the absolute maximum for
value
(keeping the original sign), plus adding the original document’stimestamp
andschema
.
So, the desired output is something like:
[
// January bucket
{
"bucket": "2024-01-01T00:00:00Z",
"value": {
"timestamp": "2024-01-01T01:00:00Z",
"schema": "1.2.0",
"absMax": -10
}
}
]
Obviously, the default $max
accumulator does not work, as it has two problems:
- It can not keep the original sign when querying absolute maxes
- It does not include
timestamp
andschema
, but only outputs the numeric value
So, I tackled the problem trying two different ways:
- In my
$group
stage, I use$push
to push all raw documents into a$raw
document, which I then go through with$reduce
. I need the$raw
document to havetimestamp
andschema
always available. - In my
$group
stage, I use a custom accumulator function (see https://www.mongodb.com/docs/manual/reference/operator/aggregation/accumulator/) that reduces each document and keeps the originaltimestamp
andschema
next to the absolute maximum in its state.
Now, I encounter the following issues:
- Solution 1 runs into memory issues, as pushing millions of documents into RAM exceeds MongoDB’s hard limit of 100MB of RAM per aggregation step
- Solution 2 does look quite ugly in code and MongoDB advises against using custom accumulator functions written in JavaScript.
I updated my question with MongoDB playgrounds:
- Solution 1 (
$push
): https://mongoplayground.net/p/01e5Oa58VfV - Solution 2 (custom accumulator): https://mongoplayground.net/p/M5mUNR7jKgP
And for the sake of completeness using only $min
and $max
, but losing timestamp
and schema
: https://mongoplayground.net/p/UegNExWo2np
Solution 2 is about twice as fast as solution 1 on big data sets.
Am I overlooking something?
2
Answers
Now after some time and research, I want to follow up and post our solution here. We ended up using the built-in
$top
accumulator (docs):We used
sortBy
to sort by the absolute of the field's value, maintaining the original sign in theoutput
. For us, this gave us significant performance increase over the original ideas I posted.You can group easier with $dateTrunc. For min and max value you can use $bottom or $first
For the
minValue
use either$top
or change the sort order tosortBy: { absValue: -1 }
If you like to use
$first
and$last
then you need to sort the entire collection, i.e.But I think this will be slower.
The rest is just some cosmetic.