skip to Main Content

I’m building an aggregation pipeline in mongodb and I’m encountering some unexpected behaviour.

The pipeline is as follow:

[{
   "$search":{
      "index":"vector_index",
      "knnBeta":{
         "vector":[
            -0.30345699191093445,
            0.6833441853523254,
            1.2565147876739502,
            -0.6364057064056396
         ],
         "path":"embedding",
         "k":10,
         "filter":{
            "compound":{
               "filter":[
                  {
                     "text":{
                        "path":"my.field.name",
                        "query":[
                           "value1",
                           "value2",
                           "value3",
                           "value4"
                        ]
                     },
                     {
                     "text":{
                        "path":"my.field.name2",
                        "query":"something_else",
                     }
                  }
               ]
            }
         }
      }
   }
},
    {
   "$project":{
      "score":{
         "$meta":"searchScore"
      },
      "embedding":0
   }
}

]

The pipeline (should) do a vector search according (vector_index, embedding, vector) (it work correctly it seems. With a filter, in particular the filter should limit the vector search to documents having my.field.name equal to value1 or value2 or ... and my.field.name2 equal to something_else.

Instead, only the second filter works, or at least it seems (the value of the second filter is a single letter).

I tried using the must clause as well in place of the filter inside the compound clause but the outcome remains the same.

I tried also removing the second filtering (the one without the list) and I still get unfiltered results.

Am I doing something wrong? how can it correctly?

2

Answers


  1. Chosen as BEST ANSWER

    Ok, I should have found the reason of this behaviour and how to solve this.

    As a default, MongoDB Atlas Search uses a as Search Analysers (for fields that are not vectors) the Standard Analyzer, in JSON:

        {
          "mappings": {
            "fields": {
              "title": {
                "type": "string",
                "analyzer": "lucene.standard"
              }
            }
          }
        } 
    

    The standard analyser

    divides text into terms based on word boundaries

    As a consequence, if the search term contains a space, it will split by spaces and search for ANY of the produced words.

    To avoid this behaviour it is necessary to use the Keyword Analyser, that on the other hand uses the whole string as search item.

    In the end, the Index definition should look like this:

    {
      "mappings": {
        "dynamic": true,
        "fields": {
          "embedding": {
            "dimensions": 768,
            "similarity": "cosine",
            "type": "knnVector"
          },
          "my.field.name": {
            "analyzer": "lucene.keyword",
            "type": "string"
          }
        }
      }
    }
    

    In particular, the first part is the definition of the (custom) vector search while

    "my.field.name": { "analyzer": "lucene.keyword", "type": "string" }

    specifies that we want to use the keyword analyser.


  2. The OP points to using a keyword Analyzer

    That would address the root cause of the issue: the text analyzer that was initially being used.

    MongoDB Atlas Search uses different types of analyzers to process the text data. The analyzers define how the text data should be indexed and searched. It affects how MongoDB handles the splitting of text into terms (or tokens) and how it matches search queries against those terms.

    The default analyzer is the Standard Analyzer, which splits text into terms based on word boundaries, often whitespace and punctuation. For instance, a search query with the term "value 2" would have been split into two separate terms: "value" and "2". This means it would match documents containing any of these separate terms, not necessarily the exact phrase "value 2".

    The solution involved switching to the Keyword Analyzer. Unlike the standard analyzer, the keyword analyzer treats the entire text as a single term. This allows it to match the exact phrase in the search query, hence solving the issue you were encountering with spaces in search queries.

    That means here updating their index mapping to include a specification for how the my.field.name field should be indexed, as shown in this fragment:

    "my.field.name": {
        "analyzer": "lucene.keyword",
        "type": "string"
    }
    
    • analyzer: It specifies to use the "lucene.keyword" analyzer, meaning the entire string will be treated as a single term.
    • type: Defines the field as a string type.

    Original answer:

    I see the compound syntax as:

    {
      $search: {
        "index": <index name>, // optional, defaults to "default"
        "compound": {
          <must | mustNot | should | filter>: [ { <clauses> } ],
          "score": <options>
        }
      }
    }
    

    The filter field within the compound stage should be an array of filter stages, but in your case, it includes a text filter and a field directly.

    To use multiple filters you will want to use must clause which is an array of filter stages that all must match.

    {
      "$search": {
        "index": "vector_index",
        "knnBeta": {
          "vector": [
            -0.30345699191093445,
            0.6833441853523254,
            1.2565147876739502,
            -0.6364057064056396
          ],
          "path": "embedding",
          "k": 10,
          "filter": {
            "compound": {
              "must": [
                {
                  "text": {
                    "path": "my.field.name",
                    "query": [
                      "value1",
                      "value2",
                      "value3",
                      "value4"
                    ]
                  }
                },
                {
                  "text": {
                    "path": "my.field.name2",
                    "query": "something_else"
                  }
                }
              ]
            }
          }
        }
      }
    },
    {
      "$project": {
        "score": {
          "$meta": "searchScore"
        },
        "embedding": 0
      }
    }
    

    Each individual filter is a separate document in the must array, specifying the path and the query conditions for that path.


    I tried using the must as well but I observe no change unfortunately.

    The issue might be the incorrect use of text filters with non-text fields or due to some data mismatch.

    For testing, run separate pipelines with individual filters to confirm if they are working as intended independently. That will help to identify if there is an issue with a specific filter.

    Double-check the field paths my.field.name and my.field.name2 to ensure they are correct and that they correspond to fields in your MongoDB documents.

    And make sure the vector_index is correctly configured and includes the fields you are trying to filter on.

    Also, sometimes, for text searches, it is better to use the phrase operator which matches the exact phrase. Try modifying your query with the phrase operator to see if it works.

    {
    "text": {
       "query": "something_else",
       "path": "my.field.name2",
       "phrase": {
          "prefix": true
       }
    }
    }
    

    All this assumes you have documents that satisfy both conditions. Sometimes, it could be that there are no matching documents which meet all the specified conditions(!).


    By doing some tests it seems that this error happens only when there are spaces in the string I am searching for (so like value 2 instead of value2). Is there any way to fix this?

    It seems like it might be related to how MongoDB full-text search handles tokenization. When a text index is created, MongoDB tokenizes the content of the fields based on whitespace and some other delimiters, and creates an index on these tokens.

    To work around this issue, you can use a $regex match in a $match stage instead of a text search for querying strings with spaces. That can be added after your $search stage to further filter the results based on the regex pattern.

    [
      {
        "$search": {
          "index": "vector_index",
          "knnBeta": {
            "vector": [
              -0.30345699191093445,
              0.6833441853523254,
              1.2565147876739502,
              -0.6364057064056396
            ],
            "path": "embedding",
            "k": 10
          }
        }
      },
      {
        "$match": {
          "$and": [
            {"my.field.name": {"$in": ["value1", "value 2", "value3", "value4"]}},
            {"my.field.name2": "something_else"}
          ]
        }
      },
      {
        "$project": {
          "score": {
            "$meta": "searchScore"
          },
          "embedding": 0
        }
      }
    ]
    

    The $search stage is used to carry out the vector search without any filters.
    A $match stage is introduced after the $search stage to filter the results based on the regex pattern and other conditions. That stage uses $in operator for my.field.name to match any of the values in the array and a simple equality match for my.field.name2.
    The $project stage remains the same, projecting the search score and excluding the embedding field.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search