skip to Main Content

Let’s say I have a product catalog index like below, where I have a list of products that have an array of individual sku child objects. I want to be able to perform a search that returns the matching product documents, but also indicate the relevancy of the child sku elements (or sort them, or something).

{
  "productId": "1",
  "name": "Cool Shirt",
  "type": "t-shirt",
  "skus": [
    {
      "skuNumber": "1-a",
      "color": "green",
      "image": "..."
    },
    {
      "skuNumber": "1-b",
      "color": "red",
      "image": "..."
    }
  ]
},
{
    ...additional documents
}

A search for red t-shirt should return this document, but I’d like to know that the second sku (color:red) was more relevant than the first sku – maybe by having a relevancy score applied to these child objects, or having Azure sort them accordingly. The goal is to be able to present a search result to a user as a product tile that highlights the most relevant child sku – in this case by displaying this "Cool Shirt" product with the red shirt sku’s image.

Real world example of this in practice:

Search https://www.amazon.com/s?k=Hanes+Unisex+T-Shirt+red and the top result is the red "sku" of the product, search https://www.amazon.com/s?k=Hanes+Unisex+T-Shirt+green and you’ll see the green "sku".

Are there any techniques to accomplish this with Azure Cognitive Search?

The investigation my team has done so far has not yielded good results. We’re migrating from a Solr search implementation where this is accomplished a bit differently – by indexing the individual skus and then grouping them by a parent id. Newer versions of Solr suggest this approach https://solr.apache.org/guide/6_6/collapse-and-expand-results.html. My understanding is that Azure search does not support these capabilities.

Our workaround

The most promising option we’ve come up with is to have two indexes. One of the products (same as above) and another of just the skus, like so:

{
  "productId": "1",
  "skuNumber": "1-a",
  "color": "green",
  "image": "..."
},
{
  "productId": "1",
  "skuNumber": "1-b",
  "color": "red",
  "image": "..."
}

We’d first perform a search to get a list of relevant products, and then follow-up with an identical search to the sku index filtered only by skus with a parent product id from first result red t-shirt $filter productId eq '1' ...etc for all product ids returned by the first search. The relevancy score of this second search would then allow us to rank the child skus as I am describing. But this seems far from an ideal solution. Any other options?

Notes

Please note:

  • I’m willing to restructure our Index(s) in any way feasible
  • There will be dozens of additional fields at the sku level beyond just "color"
  • We don’t want less/non-relevant skus to be completely filtered out; for red t-shirt we still want to display a product tile that indicates there’s a green version too, for instance
  • Relevancy of skus would need work for filtering and faceting, in addition to text search. Eg. red t-shirt, filter=inStock ,facet=price[$5-$10] would need to surface the sku that most closely matched this criteria
  • We’ll be using traditional paging of results (as opposed to infinite-scroll)

2

Answers


  1. Chosen as BEST ANSWER

    Dan Gøran Lunde's answer is worth careful consideration, especially if implementing an "infinite scroll" type search result. However, if one needs to implement traditional pagination, I don't find the solution satisfactory. Frankly, what this really means is Azure Cognitive Search isn't a satisfactory platform for search if one needs grouping/collapsing.

    In any case, I'm stuck building a solution for this with Azure search, so I wanted to share my planned approach. This isn't production battle-tested, but it is so far working in development.

    Approach

    We have two different indexes. First, the product index, which contains the set of grouped skus that comprise each product, like so:

    {
      "productId": "1",
      "name": "Cool Shirt",
      "skus": [
        {
          "productId": "1",
          "skuNumber": "1-a",
          "color": "green",
          "image": "...",
          ...all other sku data
        },
        {
          "productId": "1",
          "skuNumber": "1-b",
          "color": "red",
          "image": "...",
          ...all other sku data
        }
      ]
    }, {product2...}, {product3...}, etc
    

    Then there's a sku index, which is a flattened list of all skus:

    {
      "productId": "1",
      "skuNumber": "1-a",
      "color": "green",
      "image": "...",
      ...all other sku data
    },
    {
      "productId": "1",
      "skuNumber": "1-b",
      "color": "red",
      "image": "...",
      ...all other sku data
    },
    {
      "productId": "2",
      "skuNumber": "2-x"
      ...etc
    }, etc
    

    The Sku objects would be identical across both indexes, loaded at the same time, etc.

    Performing a Search

    To perform a search, a query is issued to the first index. All filters/facets/text queries are performed on the Skus collection. If any sku meets the criteria, then the entire product is returned. These are the products presented to the user, so result counts & pagination for the search index matches exactly how pagination is executed in the UI.

    What we don't know from this first query is which sku among each product is the most relevant. All we know is at least one sku for each product met the search criteria. So, next we perform a functionally identical search on the second (sku) index, with an added filter to only match skus with a productId from the first result. Take the result of this, and grab the top sku within each productId and we've found the most relevant sku for each product. Combine the result of the first query with this info and we've got a result of products and the primary sku within each that we want to display.

    Pitfalls

    Aside from having to execute two queries for each search, I see the following pitfalls:

    1. Consistency issues between 2 different indexes. I'm confident our processes to index the data will ensure integrity between both indexes. Could Azure's infrastructure (different replica sets, for example) introduce unexpected inconsistencies? I don't have the expertise to quite understand that. Worst case, the second query would fail to identify the correct most relevant sku. All that would mean is that a product result might not be able to highlight the best matching sku. I can live with that.

    2. Query syntax is different for each index. For the first query, everything would have to be scoped to the Sku collection level, but for the second query, everything would be top-level field queries. Thus, we'd have to ensure we generate different query parameters depending on which index is being queried.

    3. Performance? This is laughable if we're already resigned to perform 2 queries for every search, but there's a theoretical performance hit I'd imagine when searching the first index. There, we're searching on fields within a collection (ie Skus/color) instead of top-level fields on the document (as would be the case in Dan's solution where you perform the queries on a single Skus index). Initial testing with our data sets indicate this has a negligible impact, so I don't personally consider this a problem for my use-case.

    I would appreciate any additional feedback if you have any concerns with this approach. For now, this seems to be the most viable solution to the problem for us.


  2. Showing multiple product variants in search results is a typical e-commerce requirement. We have solved this with Azure Search, without using collapsing or grouping. The search engine we migrated from supported collapsing, making it easy to boost the most relevant SKU to the top while presenting a tail of related SKUs.

    See this related post: How to get only one item from each category in azure cognitive search?

    I’ll try to explain in more detail how to solve this use case with Azure Search. The constraints you list are great pointers. It’s good to know that you still have the option to restructure your index to solve this use case.

    SUGGESTED SOLUTION #1 (INFINITE SCROLL)

    • Store each SKU as a separate item in the index, without child items.
    • Tag each item with an ID for grouping
    • The grouping ID should be refinable
    • You are not limiting the grouping to color or any specific property. The grouping ID is an independent property for grouping products.

    Submit your query as normal. Including any free text queries, boosting, filtering, or sorting options you want. This will work as expected. Make sure you include your grouping property as a refiner.

    Then traverse your results going through the items one by one. Keep the first item for each group. Skip any subsequent items from a group you have already seen.

    Now you can choose if you want to only present the head of each group. E.g. you only present the red t-shirt from your example. The grouping refiner will contain the exact SKU count for your query. You can also produce a link that filters by the item’s group ID to list all variants.

    • This solution ensures you only show the most relevant SKU. I.e. you have filtered by red variants by having the word red in your query.
    • This would also work if you had applied a filter to only show shirts in size XL. The red t-shirts unavailable in size:XL would then disappear.
    • If you also want black t-shirts to appear in your free text query for red t-shirts, you would need to process your items before indexing to contain a description of the available variants. Use a searchable text property like "these items also comes in other variants like black, blue, green, …"
    {
        "value": [
            {
                "id": "1",
                "sku": "9001234",
                "title": "Hayne's Unisex T-Shirt",
                "group": "HAY2022",
                "color": "green",
                "variants": "available in green, black, red and blue"
            },
            {
                "id": "2",
                "sku": "9005678",
                "title": "Hayne's Unisex T-Shirt",
                "group": "HAY2022",
                "color": "red",
                "variants": "available in green, black, red and blue"
            },
            {
                "id": "3",
                "sku": "8001234",
                "title": "Levi's T-Shirt",
                "group": "LEV2022",
                "color": "red",
                "variants": "available in black and red"
            }
        ]
    }

    It’s worth noting that you may have to request a larger number of results than you actually present. For example, if your goal is to present 10 items on a page you may have a scenario where the first item has 20 variants. You would then only present/keep the head entry.

    Therefore, you have to request a larger result set. It will have a slight impact on your performance, but we have found that is negligible for end users. We have used this solution in production for a few years now, and it works well. It resolves all the points you have mentioned.

    SUGGESTED SOLUTION #2

    Updated with the new constraints to not use infinite scroll. Your Amazon examples for red- or green t-shirts only show the corresponding colors. This would indicate that each SKU is stored as individual items in the index, containing only information about the SKU without information about the variants.

    In your case, you also want the variants not matching the original query to be included. When the end user query is ‘red t-shirt’, you want to show red t-shirts as the top results (if there are any matches). However, you also want to include green t-shirts, if there are any variants containing the token ‘green’.

    • Store each SKU as a separate item in the index, without child items.
    • Each item should only have keywords relevant for that SKU. I.e. red t-shirts do not have a searchable token containing green if there is a green version.
    • Tag each item with an ID for grouping
    • The grouping ID should be refinable
    • You are not limiting the grouping to color or any specific property. The grouping ID is an independent property for grouping products.

    Query: Generate a query with the free text input from the end user. Apply any filtering and boosting- or sorting rules to the query.

    To present results you have a few options. Both require two queries.

    1. Present results in order. Traverse the presented results and collect the grouping ID from each result. Submit a secondary query without the end user free text, using a $filter with search.in(). E.g. search=*&$filter=search.in(groupid, ‘groupA,groupC,groupX’, ‘,’). Then either append the results from the secondary query as separate tiles, or render them as variants for your existing tiles.

    2. Submit the first query in your backend only. Then collect the group IDs from the results and submit a secondary query as an OR-query containing your original query and a filter query based on the group ids returned by the group id refiner. E.g. OR . This will give you a result containing both your red t-shirts at the top AND the variants from the matching groups with other colors further down.

    AZURE USER VOICE

    The optimal solution would be to have collapsing support in Azure Search. You could vote for collapsing in the Azure Search user voice as mentioned in the related SO post. The Azure Search user voice entry for collapsing was moved and hasn’t been updated in 7 years it seems:

    https://feedback.azure.com/d365community/idea/0c5a17be-0225-ec11-b6e6-000d3a4f07b8

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search