skip to Main Content

Here is the query:

    const tags = await mongo
      .collection("positive")
      .aggregate<{ word: string; count: number }>([
        {
          $lookup: {
            from: "search_history",
            localField: "search_id",
            foreignField: "search_id",
            as: "history",
            pipeline: [
              {
                $match: {
                  created_at: { $gt: prevSunday.toISOString() },
                },
              },
              {
                $group: {
                  _id: "$url",
                },
              },
            ],
          },
        },
        {
          $match: {
            history: { $ne: [] },
          },
        },
        {
          $group: {
            _id: "$word",
            url: {
              $addToSet: "$history._id",
            },
          },
        },
        {
          $project: {
            _id: 0,
            word: "$_id",
            count: {
              $size: {
                $reduce: {
                  input: "$url",
                  initialValue: [],
                  in: {
                    $concatArrays: ["$$value", "$$this"],
                  },
                },
              },
            },
          },
        },
        {
          $sort: {
            count: -1,
          },
        },
        {
          $limit: 50,
        },
      ])
      .toArray();

I think I need an index but not sure how or where to add.

2

Answers


  1. The best thing to do is to limit the number of documents passed to each stage.
    Indexes are used by mongo in aggregations only in the first stage only if it’s a match, using 1 index max.

    So the best thing to do is to have a match on an indexed field that is very restrictive.

    Moreover, please note that $limit, $skip and $sample are not panaceas because they still scan the entire collection.

    A way to efficiently limit the number of documents selected on the first stage is to use a "pagination". You can make it work like this :

    Once every X requests

    1. Count the number of docs in the collection
    2. Divide this in chunks of Yk max
    3. Find the _ids of the docs at the place Y, 2Y, 3Y etc with skip and limit
    4. Cache the results in redis/memcache (or as global variable if you really cannot do otherwise)

    Every request

    1. Get the current chunk to scan by reading the redis keys used and nbChunks
    2. Get the _ids cached in redis used to delimit the next aggregation id:${used%nbChunks} and id:${(used%nbChunks)+1} respectively
    3. Aggregate using $match with _id:{$gte: ObjectId(id0), $lt: ObjectId(id1)}) }
    4. Increment used, if used > X then update chunks

    Further optimisation

    If using redis, supplement every key with ${cluster.worker.id}:to avoid hot keys.

    Notes

    1. The step 3) of the setup of chunks can be a really long and intensive process, so do it only when necessary, let’s say every X~1k requests.
    2. If you are scanning the last chunk, do not put the $lt
    3. Once this process implemented, your job is to find the sweet spot of X and Y that suits your needs, constrained by a Y being large enough to retrieve max documents while being not too long and a X that keeps the chunks roughly equals as the collection has more and more documents.
    4. This process is a bit long to implement but once it is, time complexity is ~O(Y) and not ~O(N). Indeed, the $match being the first stage and _id being a field that is indexed, this first stage is really fast and limits to max Y documents scanned.

    Hope it help =) Make sure to ask more if needed =)

    Login or Signup to reply.
  2. Perhaps performance of this operation should be revisited after we confirm that it is satisfying the desired application logic that the approach itself is reasonable.

    When it comes to performance, there is nothing that can be done to improve efficiency on the positive collection if the intention is to process every document. By definition, processing all documents requires a full collection scan.

    To efficiently support the $lookup on the search_history collection, you may wish to confirm that an index on { search_id: 1, created_at: 1, url: 1 } exists. Providing the .explain("allPlansExecution") output would allow us to better understand the current performance characteristics.

    Desired Logic

    Updating the question to include details about the schemas and the purpose of the aggregation would be very helpful with respect to understanding the overall situation. Just looking at the aggregation, it appears to be doing the following:

    • For every single document in the positive collection, add a new field called history.
    • This new field is a list of url values from the search_history collection where the corresponding document has a matching search_id value and was created_at after last Sunday.
    • The aggregation then filters to only keep documents where the new history field has at least one entry.
    • The next stage then groups the results together by word. The $addToSet operator is used here, but it may be generating an array of arrays rather than de-duplicated urls.
    • The final 3 stages of the aggregation seem to be focused on calculating the number of urls and returning the top 50 results by word sorted on that size in descending order.

    Is this what you want? In particular the following aspects may be worth confirming:

    • Is it your intention to process every document in the positive collection? This may be the case, but it’s impossible to tell without any schema/use-case context.
    • Is the size calculation of the urls correct? It seems like you may need to use a $map when doing the $addToSet for the $group instead of using $reduce for the subsequent $project.
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search