Here is the query:
const tags = await mongo
.collection("positive")
.aggregate<{ word: string; count: number }>([
{
$lookup: {
from: "search_history",
localField: "search_id",
foreignField: "search_id",
as: "history",
pipeline: [
{
$match: {
created_at: { $gt: prevSunday.toISOString() },
},
},
{
$group: {
_id: "$url",
},
},
],
},
},
{
$match: {
history: { $ne: [] },
},
},
{
$group: {
_id: "$word",
url: {
$addToSet: "$history._id",
},
},
},
{
$project: {
_id: 0,
word: "$_id",
count: {
$size: {
$reduce: {
input: "$url",
initialValue: [],
in: {
$concatArrays: ["$$value", "$$this"],
},
},
},
},
},
},
{
$sort: {
count: -1,
},
},
{
$limit: 50,
},
])
.toArray();
I think I need an index but not sure how or where to add.
2
Answers
The best thing to do is to limit the number of documents passed to each stage.
Indexes are used by mongo in aggregations only in the first stage only if it’s a match, using 1 index max.
So the best thing to do is to have a match on an indexed field that is very restrictive.
Moreover, please note that
$limit
,$skip
and$sample
are not panaceas because they still scan the entire collection.A way to efficiently limit the number of documents selected on the first stage is to use a "pagination". You can make it work like this :
Once every X requests
Every request
used
andnbChunks
id:${used%nbChunks}
andid:${(used%nbChunks)+1}
respectively$match
with_id:{$gte: ObjectId(id0), $lt: ObjectId(id1)}) }
used
, ifused > X
then update chunksFurther optimisation
If using redis, supplement every key with
${cluster.worker.id}:
to avoid hot keys.Notes
$lt
$match
being the first stage and _id being a field that is indexed, this first stage is really fast and limits to max Y documents scanned.Hope it help =) Make sure to ask more if needed =)
Perhaps performance of this operation should be revisited after we confirm that it is satisfying the desired application logic that the approach itself is reasonable.
When it comes to performance, there is nothing that can be done to improve efficiency on the
positive
collection if the intention is to process every document. By definition, processing all documents requires a full collection scan.To efficiently support the
$lookup
on thesearch_history
collection, you may wish to confirm that an index on{ search_id: 1, created_at: 1, url: 1 }
exists. Providing the.explain("allPlansExecution")
output would allow us to better understand the current performance characteristics.Desired Logic
Updating the question to include details about the schemas and the purpose of the aggregation would be very helpful with respect to understanding the overall situation. Just looking at the aggregation, it appears to be doing the following:
positive
collection, add a new field calledhistory
.url
values from thesearch_history
collection where the corresponding document has a matchingsearch_id
value and wascreated_at
after last Sunday.history
field has at least one entry.word
. The$addToSet
operator is used here, but it may be generating an array of arrays rather than de-duplicatedurl
s.url
s and returning the top50
results byword
sorted on that size in descending order.Is this what you want? In particular the following aspects may be worth confirming:
positive
collection? This may be the case, but it’s impossible to tell without any schema/use-case context.url
s correct? It seems like you may need to use a$map
when doing the$addToSet
for the$group
instead of using$reduce
for the subsequent$project
.