skip to Main Content

I have all combination of compound indexes for this collection. The aggregattion query i used is:


db.products.aggregate( [
    {
      $facet: {
        "categorizedByColor": [
          { 
            $match: { 
              size: { $in : [50,60,70] },
              brand: { $in : ["Raymond","Allen Solly","Van Heusen"] } 
            } 
          },
          {
            $bucket: {
              groupBy: "$color",
              default: "Other",
              output: {
                "count": { $sum: 1 }
              }
            }
          }
        ],
        "categorizedBySize": [
          { 
            $match: { 
              color: { $in : ["Red","Green","Blue"] },
              brand: { $in : ["Raymond","Allen Solly","Van Heusen"] } 
            } 
          },
          {
            $bucket: {
              groupBy: "$size",
              default: "Other",
              output: {
                "count": { $sum: 1 }
              }
            }
          }
        ],
        "categorizedByBrand": [
          { 
            $match: { 
              color: { $in : ["Red","Green","Blue"] },
              size: { $in : [50,60,70] }
            } 
          },
          {
            $bucket: {
              groupBy: "$brand",
              default: "Other",
              output: {
                "count": { $sum: 1 }
              }
            }
          }
        ],
        "productResults": [
          { 
            $match: {
              color: { $in : ["Red","Green","Blue"] },
              size: { $in : [50,60,70] },
              brand: { $in : ["Raymond","Allen Solly","Van Heusen"] }
            } 
          }
        ]
      }
    }
  ]);

This query took around 6s to populate the results. Is there any alterative approach available to use mongodb indexing?

Note: This aggregation query have more than 14 facet pipelines. For better understanding i have provided only 4 facet pipelines.

2

Answers


  1. Facet and indexes

    Facet stage by default cannot use indexes and will perform COLLSCAN (full scan) when executed.

    Because of that, you should use filtering (and sorting) way earlier in your pipeline, in order to get the "common data" for all the sub-pipelines in $facet.
    So, in your case, filters :

            $match: {
              color: { $in : ["Red","Green","Blue"] },
              size: { $in : [50,60,70] },
              brand: { $in : ["Raymond","Allen Solly","Van Heusen"] }
            } 
    

    should be used as a first stage in pipeline, then followed by $facet.

    Hope I was clear enough. 🙂

    Login or Signup to reply.
  2. Sometimes 14 queries can do the job and sometimes not.
    If the $facet is the first step in the aggregation pipeline, 14 queries are a more efficient option, but if this $facet is following a complex pipeline to create or filter these documents, there are alternatives to this $facet‘s $match. Sometimes One needs a snapshot of the db, which 14 queries can not give, since the db may change in-between.

    Since we don’t have any data of former actions in this pipeline, and the question is regarding alternatives that will allow to use the indexes to make the rest of the query faster, I can offer one option for example. It is hard to tell if it will be faster than other options, according to the data we have here, but it will allow to use the indexes, which is the main idea of the question:

    1. The first step is according to both @Takis and @Gibbs smart suggestion.
    2. The second phase will use the indexes to make the $facet‘s $match much easier, by marking in advance which document belongs to which $facet pipeline.
    db.collection.aggregate([
    { 
        $match: {$or: [
            {size:  {$in: [50, 60, 70]}},
            {color: {$in: ["Red", "Green", "Blue"]}},
            {brand: {$in: ["Raymond", "Allen Solly", "Van Heusen"]}}
          ]
        }
      },
      {
        $addFields: {
          categorizedByColor: {
            $cond: [{$and: [{$in: ["$size",[50, 60, 70]]},
                            {$in: ["$brand",["Raymond", "Allen Solly", "Van Heusen"]]}]
              }, true, false]
          },
          categorizedBySize: {
            $cond: [{$and: [{$in: [ "$color", ["Red", "Green", "Blue"]]},
                  {$in: ["$brand",["Raymond", "Allen Solly", "Van Heusen"]]}]
              }, true, false]
          },
          categorizedByBrand: {
             $cond: [{$and: [{$in: [ "$color", ["Red", "Green", "Blue"]]},
                            {$in: ["$size",[50, 60, 70]]}]
              }, true, false]
          },
          productResults: {
            $and: [{$in: ["$color", ["Red", "Green", "Blue"]]},
                    {$in: ["$size",[50, 60, 70]]},
                    {$in: ["$brand",["Raymond", "Allen Solly", "Van Heusen"]]}]
              }
          }
      },
      {
        $facet: {
          "categorizedByColor": [
            {$match: {categorizedByColor: true}},
            {
              $bucket: {
                groupBy: "$color",
                default: "Other",
                output: {"count": {$sum: 1}}
              }
            }
          ],
          "categorizedBySize": [
            {$match: {categorizedBySize: true}},
            {
              $bucket: {
                groupBy: "$size",
                default: "Other",
                output: {"count": {$sum: 1}}
              }
            }
          ],
          "categorizedByBrand": [
            {$match: {categorizedByBrand: true}},
            {
              $bucket: {
                groupBy: "$brand",
                default: "Other",
                output: {"count": {$sum: 1}}
              }
            }
          ],
          "productResults": [{$match: {productResults: true}}]
        }
      }
    ])
    

    Playground example

    Going a step further, there is even a way to get these results in one query without the $facet step at all, by using $group with $push with $cond instead. This should iterate over the documents once, instead of 14 times, but may result in a large document (with duplicates of data per each categorization). The main idea of such a solution can be seen on this mongoDB playground. It is important to say that these methods are not necessarily better or worse than other. The "right" solution depends on your specific case and data, which we can’t see here. You asked for alternative approaches which will allow to use the indexes, so I’m pointing some directions.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search