I have a bunch of event data which is identified by uid
and timestamp
. I have a number of aggregation pipelines which will calculate statistics for each user. These work if I run them one at a time, but I can’t figure out how to run them within a $group
.
So this code works:
db.Events.aggregate([
{"$match":{"uid":"S001"}},
{"$facet":{
"var1":[pipeline1...],
"var2":[pipeline2...]
}}
];
However, I want to loop over the users, so I want something more like:
db.Events.aggregate([
{ "$group": {
"_id": {"uid":"$uid"},
"timestamp":{"$last":"$timestamp"},
"data": { "$facet":{
"var1":[pipeline1...],
"var2":[pipeline2...]
}}
}}
])
But I get a error unknown group operator '$facet'
.
What I’m looking for is output that looks like:
{
uid:"S001",
timestamp:"time 1",
data: { ...}
},
{
uid:"S002",
timestamp:"time 2",
data: { ...}
},
...
How can I run the $facet
inside of the $group
.
Update: Here is a more complete example.
First here is some input:
{
"uid": 1,
"timestamp": 1000000000,
"verb": "initialized",
"object": "game level",
"data":{},
},
{
"uid": 1,
"timestamp": 1000000015,
"verb": "initialized",
"object": "game level",
"data":{},
},
{
"uid": 1,
"timestamp": 1000000031,
"verb": "passed",
"object": "game level",
"data":{},
},
{
"uid": 2,
"timestamp": 1000000000,
"verb": "initialized",
"object": "game level",
"data":{},
},
{
"uid": 2,
"timestamp": 1000000024,
"verb": "passed",
"object": "game level",
"data":{},
}
Here is my target pipeline:
db.collection.aggregate([
{ "$group": {
"_id": {"uid":"$uid"},
"uid": {"$first":"$uid"},
"timestamp":{"$last":"$timestamp"},
"data": { "$facet":{
"NumberAttempts": [attempt_filter, attempt_map, attempt_sum],
"time": [leveltime_filter, leveltime_map, leveltime_reduce]
}}
]);
where the filter
stages are calls to $match
, the map
stages are calls to $project
and the reduce
stages are calls to $group
to accumulate over the matched events. The complete definitions for the stages in the pipeline can be found in:
https://mongoplayground.net/p/Cae0DkO8o0g
(Don’t get too hung up on the exact definitions, my real example has more different pipelines).
My desired output would be something like:
{
"data": {
"NumberAttempts": 2,
"start_time": 1e+09,
"total_time": 31
},
"timestamp": 1.000000031e+09,
"uid": 1
},
{
"data": {
"NumberAttempts": 1,
"start_time": 1e+09,
"total_time": 24
},
"timestamp": 1.000000024e+09,
"uid": 2
}
Where I have one record for each uid
which contains summary statistics. I can do this for one user at a time if I first match on the uid
and then run the facet operation. [This is not quite working in the mongoplayground link. The problem is I need a custom aggregator for total_time
and I can’t quite figure how to get it to work in the playground.] What I want to do is iterate this code over users, which seems to me would be the $group
or $bucket
operator.
I am not looking for a clever way to rewrite these particular pipelines using group accumulator functions. I would like my SMEs to be able to specify which events they are interested in, what fields in those events are relevant and select an accumulator, thus building a custom map-reduce chain.
2
Answers
It seems like the most straightforward way is to use a loop in the language I'm using to call Mongo.
If I use javascript in the Mongo shell, I would get:
In my case, it might work a bit better, as data from different users will be complete or incomplete at any given time, so I can run the pipeline for the user when that user's data is complete.
Not sure how much data we are talking about but if it is not millions of rows, you can always make an array and then easily iterate on it:
I just hacked up the functions to calc
NumberAttempts
andtotal_time
.It is also possible to do this with one
reduce
loop at the expense of simplicity of working with scalars vs. objects inreduce
:NOTE:
If each
uid
only has a few entries, you might want to consider eliminating the grouping on the server side and let the client handle it. It is the same number of records examined on the server with significantly less compute pressure, at the expense of flowing X times more data over the wire.UPDATED
Below is a semi-solution using
$facet
. It lacks some of the specific logic aroundtotal_time
but I didn’t quite understand thesafe_sum_acc
andtimecode
bits. The key thing is to merge thetwo facet outputs then regroup on the
_id
: