skip to Main Content

I am using mongo db to store notifications, a cron job runs every midnight to pick old notification and delete them.

The deletion policy has a logic to retain the oldest notification of each combination of "sourceObjectIdentifier" and "notificationType". The aggregated output of query 1 is given to query 2 to form the deletion query.

query 1

db.persistentEvent.aggregate([
  {
    "$match": {
      "notificationClass": {"$eq": "category"}
    }
  },
  {
    "$group": {
      "_id": [
        {"sourceObjectIdentifier": "$sourceObjectIdentifier"},
        {"notificationType": "$notificationType"}
      ],
      "notificationId": {"$last": "$notificationId"},
      "notificationClass": {"$last": "$notificationClass"}
    }
  },
  {
    "$project": {
      "_id": 0,
      "notificationId": 1,
      "notificationClass": 1
    }
  },
  {
    "$sort": {
      "sourceObjectIdentifier": 1,
      "notificationType": 1,
      "creationTime": -1
    }
  }
])

query 2

{
  "deliveryTime": {
    "$gt": "$datefrom"
  },
  "creationTime": {
    "$lt": "dateto"
  },
  "notificationId": {
    "$nin": [query_1_output]
  }
}

This deletion logic works fine if the number of notifications in the collection is less but it starts to take a long time when the number of notifications are in range 200k or above.

Can someone please suggest how can I improve the query and can the query 1 and 2 be executed together?

2

Answers


  1. As written, query 1 will scan far more documents than is really necessary.
    Also, by the time the sort stage is encountered, the group stage and project stage will have removed all fields except notificationId and notificationClass. This means none of the fields in the sort object will exist, so that sort will accomplish nothing.

    To improve this query:

    • move the sort stage to before the group stage
    • create and index that includes the match field and the sort fields
    • if the 5 fields mention in this query do not constitute the vast majority of the size of each document, also add notificationId to the index.

    If the query executor is able to select and sort the documents based on the index, that will eliminate the need to load non-matching documents from storage, and will avoid an in-memory sort. Both of these will help performance.

    If the documents are significantly larger than the combination of notificationClass, notificationId, sourceObjectIdentifier, notificationType, and creationTime, creating an index containing those fields, following the ESR rule will permit query 1 to be satisfied using only the information included in the index, making it a covered query:

    {
       notificationClass: 1,
       sourceObjectIdentifier: 1,
       notificationType: 1,
       creationTime: -1,
       notificationId: 1
    }
    

    With the above index this pipeline should perform significantly better than the existing query as the collection grows:

    [
      {
        "$match": {
          "notificationClass": {"$eq": "category"}
        }
      },
      {
        "$sort": {
          "sourceObjectIdentifier": 1,
          "notificationType": 1,
          "creationTime": -1
        }
      },
      {
        "$group": {
          "_id": [
            {"sourceObjectIdentifier": "$sourceObjectIdentifier"},
            {"notificationType": "$notificationType"}
          ],
          "notificationId": {"$last": "$notificationId"},
          "notificationClass": {"$last": "$notificationClass"}
        }
      },
      {
        "$project": {
          "_id": 0,
          "notificationId": 1,
          "notificationClass": 1
        }
      }
    ]
    
    Login or Signup to reply.
  2. It is not clear how you define "newest notification".

    • The notification which was inserted most recently?
    • The notification with the greatest notificationId?
    • The notification with the newest creationTime?

    First ensure that you have an index on the notificationClass. A composed index on {sourceObjectIdentifier: 1, notificationType: 1} may also help.

    Use $setWindowFields to find the newest notifications (rank = 1), and delete all others. Below command considers "newest notification" as notification with the newest creationTime.

    var deleteDocs = [];
    db.persistentEvent.aggregate([
       { $match: { notificationClass: "category" } },
       {
          $setWindowFields: {
             partitionBy: {
                sourceObjectIdentifier: "sourceObjectIdentifier",
                notificationType: "$notificationType"
             },
             sortBy: { creationTime: -1 },
             output: {
                rank: { $rank: {} },
             }
          }
       },
       { $match: { rank: { $gt: 1 } } },
       { $project: { _id: 1 } }
    ]).toArray().forEach(doc => {
       deleteDocs.push(doc._id);
       if (deleteDocs.length > 10000) {
          db.collection.persistentEvent({ _id: { $in: deleteDocs } });
          deleteDocs = [];
       }
    })
    if (deleteDocs.length > 0)
       db.persistentEvent.deleteMany({ _id: { $in: deleteDocs } });
    
       
    

    Mongo Playground

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search