skip to Main Content

Our DB has hundreds of thousands of rows of data where a particular entity amasses points in a certain entry based on certain constraints.

Let us call the entities A1, A2A1000. Let the constraints be X1, X2, X3Xn. The constraints are "met" in the document by setting the value to 1 for that field, like say for one document, X1 field has value 1, X45 has 0, X49 has 1, etc.

Every document, chronicles, serially, the points which say the main entity in it (like A1, A2, etc.) accrued in that scenario under a given set of constraints – say A1 amassed 71 points under conditions of X1, X87 and X90 – this is one document.

Similarly supposed second row says A2 amassed 92 points under X21, X45 and X67.

Now next row has A1 again, this time amassing 9 points, but under criteria X1 and X90 only.

Next row – A1 again, under X1, X56, X71 and X90, say 45 points.

Next row has A4, under X1, X90, X67, amassing 345 points.

Next row has A4, under X1, X56, X67, amassing 12 points.

We have millions of rows like this – where the entity can amass any points under any combination of criteria, serially, each document representing one scenario.

Question – the queries run like “Who amassed 90 points the fastest under criteria X1 and X90”. In this case, from our data above, none of the documents from entity A2 fits the criteria, A1 does – all three documents. A4 does in only 1 of the 2 documents, its second document has no X90, only X1, and hence it does not meet the criteria.

However, the cumulative numbers show that it took A1 three documents (meaning three entries) to amass 90 – the first three (totalling 125, exceeding 90). While, A4, in the very first document, amasses 345, thus automatically crossing 90, and since it took 1 document, lesser than the 3 documents A1 took, we can safely say that A4 amassed 90 points the fastest.

Let us assume all documents have an insertion time entry which we can use to sort them during query, since this class of queries require them to be sorted on basis of insertion into the DB.

Now what kind of data modelling or indexing can enable us to calculate these kind of calculations super fast? We cannot precompute aggregates at every document level, since the number of combinations X1, X2X100 can have are virtually infinite and it is not scaleable to compute totals under each combination and store them in the document, while the query can come with any combination of X1X100 conditions.

2

Answers


  1. It’s difficult to recommend a data model without knowing the full range of queries that may be desired, but here’s a stab at the problem.

    Documents could be something like this:

    {
      "_id": 1,
      "entity": "A1",
      "points": 71,
      "constraints": ["X1", "X87", "X90"],
      "timestamp": ISODate("2023-08-01T10:00:00.0Z")
    }
    

    “Who amassed 90 points the fastest under criteria X1 and X90?” … could be an aggregation pipeline like this.

    db.collection.aggregate([
      {
        "$match": {
          "constraints": {
            // only interested in these constraints
            "$all": ["X1", "X90"]
          }
        }
      },
      { // include pointsSum field in surviving docs
        // include first timestamp to calculate elapsed time
        "$setWindowFields": {
          "partitionBy": "$entity",
          "sortBy": {"timestamp": 1},
          "output": {
            "pointsSum": {
              "$sum": "$points",
              "window": {
                "documents": ["unbounded", "current"]
              }
            },
            "firstTime": {
              "$first": "$timestamp"
            }
          }
        }
      },
      { // only care about docs where pointsSum >= 90
        "$match": {
          "pointsSum": {"$gte": 90}
        }
      },
      { // calculate elapsed time
        "$set": {
          "elapsedTime": {
            "$subtract": ["$timestamp", "$firstTime"]
          }
        }
      },
      {
        // rank the docs by elapsed time
        "$setWindowFields": {
          "sortBy": {"elapsedTime": 1},
          "output": {
            "rank": {"$rank": {}}
          }
        }
      },
      { // only care about the fastest (could be ties)
        "$match": {
          "rank": 1
        }
      },
      { // output desired fields
        "$project": {
          "entity": 1,
          "timestamp": 1,
          "elapsedTime": 1
        }
      }
    ])
    

    Try it on mongoplayground.net.

    Login or Signup to reply.
  2. I think one option is to use the natural data model (for example):

     {
        "_id": 1,
        "entity": "A1",
        "points": 71,
        "X1": 1,
        "X87": 1,
        "X90": 1,
        "timestamp": ISODate("2023-08-01T10:00:00.0Z")
      },
    

    With a query based on $setWindowFields:

    db.collection.aggregate([
      {$setWindowFields: {
          partitionBy: "$entity",
          sortBy: {timestamp: 1},
          output: {
            count: {
              $sum: 1,
              window: {documents: ["unbounded", "current"]}
            },
            amass: {
              $sum: {
                $cond: [
                  {$and: [{$eq: ["$X1", 1]}, {$eq: ["$X90", 1]}]},
                  "$points",
                  0
                ]
              },
              window: {documents: ["unbounded", "current"]}
            }
          }
        }
      },
      {$match: {amass: {$gte: 90}}},
      {$group: {_id: "$entity", count: {$first: "$count"}}}
    ])
    

    See how it works on the playground example

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search