skip to Main Content

I have millions of docs containing start and end times of something they’re measuring.

[
   {
    "start" : ISODate("2024-01-01T00:00:00"),
    "end" :   ISODate("2024-01-01T01:00:00")
   },
   {
    "start" : ISODate("2024-01-01T01:00:00"),
    "end" :   ISODate("2024-01-01T02:00:00")
   },
   {
    "start" : ISODate("2024-01-01T03:00:00"),
    "end" :   ISODate("2024-01-01T04:00:00")
   },
]

I would like a way of generating a result set containing only the non-contiguous blocks. For this example, that would look like:

[
   {
    "start" : ISODate("2024-01-01T00:00:00"),
    "end" :   ISODate("2024-01-01T02:00:00")
   },
   {
    "start" : ISODate("2024-01-01T03:00:00"),
    "end" :   ISODate("2024-01-01T04:00:00")
   },
]

I’m experienced with using aggregate and $group, but I can’t think of a way to acheive what I need.

Any suggestions would be much appreciated.

2

Answers


  1. You can achieve the expected behaviour through 2 $setWindowFields. For the first $setWindowFields, you can find the $max end field for each document in the document range of ["unbounded", -1] that is sorted by {start: 1}. i.e. For a current document, you will search in the document range that has a start smaller than the current document and get the max end. We call this value prevMax.

    Then, by comparing the prevMax with current document’s start. You can determine if it should be start of a group. We call this boolean isStart.

    After that, by using another $setWindowFields, you can conditionally find the grouping of all documents through isStart and in the document range of [current, unbounded]

    Finally, using the grouping field to find $max end inside the grouping.

    db.collection.aggregate([
      {
        "$setWindowFields": {
          "sortBy": {
            "start": 1
          },
          "output": {
            "prevMax": {
              "$max": "$end",
              "window": {
                "documents": [
                  "unbounded",
                  -1
                ]
              }
            }
          }
        }
      },
      {
        $set: {
          isStart: {
            $lt: [
              "$prevMax",
              "$start"
            ]
          }
        }
      },
      {
        "$setWindowFields": {
          "sortBy": {
            "start": 1
          },
          "output": {
            "grouping": {
              "$max": {
                "$cond": {
                  "if": {
                    "$eq": [
                      true,
                      "$isStart"
                    ]
                  },
                  "then": "$start",
                  "else": -1
                }
              },
              "window": {
                "documents": [
                  "unbounded",
                  "current"
                ]
              }
            }
          }
        }
      },
      {
        "$group": {
          _id: "$grouping",
          end: {
            $max: "$end"
          }
        }
      },
      {
        "$project": {
          _id: 0,
          start: "$_id",
          end: 1
        }
      }
    ])
    

    Mongo Playground

    Login or Signup to reply.
  2. Following @ray, if we can assume there are no overlaps in the documents, there is an option to look only at the previous document and the next document for each document, not scanning all previous document for each document (twice). This can also help with matching, which can reduce the number of documents we need to $set:

    db.collection.aggregate([
      {$setWindowFields: {
          sortBy: {start 1},
          output: {
            prevMax: {$push: "$end", window: {documents: [-1, 0]}},
            nextMin: {$push: "$start", window: {documents: [0, 1]}}
          }
      }},
      {$match: {$or: [
            {$expr: {$lt: [{$first: "$prevMax"}, "$start"]}},
            {$expr: {$gt: [{$last: "$nextMin"}, "$end"]}},
            {"prevMax.1": {$exists: false}},
            {"nextMin.1": {$exists: false}}
      ]}},
      {$set: {
          isStart: {$or: [
              {$lt: [{$first: "$prevMax"}, "$start"]},
              {$eq: [{$size: "$prevMax"}, 1]}
          ]}
      }},
      {$setWindowFields: {
          sortBy: {start: 1},
          output: {grouping: {
              $max: {$cond: [{$eq: [true, "$isStart"]}, "$start", -1}},
              window: {documents: [-1, 0]}
          }}
      }},
      {$group: {_id: "$grouping", end: {$max: "$end"}}},
      {$project: {_id: 0, start: "$_id", end: 1}}
    ])
    

    See how it works on the playground example

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search