Our DB has hundreds of thousands of rows of data where a particular entity amasses points in a certain entry based on certain constraints.
Let us call the entities A1
, A2
…A1000
. Let the constraints be X1
, X2
, X3
…Xn
. The constraints are "met" in the document by setting the value to 1 for that field, like say for one document, X1
field has value 1, X45
has 0, X49
has 1, etc.
Every document, chronicles, serially, the points which say the main entity in it (like A1
, A2
, etc.) accrued in that scenario under a given set of constraints – say A1
amassed 71 points under conditions of X1
, X87
and X90
– this is one document.
Similarly supposed second row says A2
amassed 92 points under X21
, X45
and X67
.
Now next row has A1
again, this time amassing 9 points, but under criteria X1
and X90
only.
Next row – A1
again, under X1
, X56
, X71
and X90
, say 45 points.
Next row has A4
, under X1
, X90
, X67
, amassing 345 points.
Next row has A4
, under X1
, X56
, X67
, amassing 12 points.
We have millions of rows like this – where the entity can amass any points under any combination of criteria, serially, each document representing one scenario.
Question – the queries run like “Who amassed 90 points the fastest under criteria X1
and X90
”. In this case, from our data above, none of the documents from entity A2
fits the criteria, A1
does – all three documents. A4
does in only 1 of the 2 documents, its second document has no X90
, only X1
, and hence it does not meet the criteria.
However, the cumulative numbers show that it took A1
three documents (meaning three entries) to amass 90 – the first three (totalling 125, exceeding 90). While, A4
, in the very first document, amasses 345, thus automatically crossing 90, and since it took 1 document, lesser than the 3 documents A1
took, we can safely say that A4
amassed 90 points the fastest.
Let us assume all documents have an insertion time entry which we can use to sort them during query, since this class of queries require them to be sorted on basis of insertion into the DB.
Now what kind of data modelling or indexing can enable us to calculate these kind of calculations super fast? We cannot precompute aggregates at every document level, since the number of combinations X1
, X2
…X100
can have are virtually infinite and it is not scaleable to compute totals under each combination and store them in the document, while the query can come with any combination of X1
…X100
conditions.
2
Answers
It’s difficult to recommend a data model without knowing the full range of queries that may be desired, but here’s a stab at the problem.
Documents could be something like this:
“Who amassed 90 points the fastest under criteria X1 and X90?” … could be an aggregation pipeline like this.
Try it on mongoplayground.net.
I think one option is to use the natural data model (for example):
With a query based on
$setWindowFields
:See how it works on the playground example