I’m building an aggregation pipeline in mongodb and I’m encountering some unexpected behaviour.
The pipeline is as follow:
[{
"$search":{
"index":"vector_index",
"knnBeta":{
"vector":[
-0.30345699191093445,
0.6833441853523254,
1.2565147876739502,
-0.6364057064056396
],
"path":"embedding",
"k":10,
"filter":{
"compound":{
"filter":[
{
"text":{
"path":"my.field.name",
"query":[
"value1",
"value2",
"value3",
"value4"
]
},
{
"text":{
"path":"my.field.name2",
"query":"something_else",
}
}
]
}
}
}
}
},
{
"$project":{
"score":{
"$meta":"searchScore"
},
"embedding":0
}
}
]
The pipeline (should) do a vector search according (vector_index, embedding, vector) (it work correctly it seems. With a filter, in particular the filter should limit the vector search to documents having my.field.name
equal to value1
or value2
or ...
and my.field.name2
equal to something_else
.
Instead, only the second filter works, or at least it seems (the value of the second filter is a single letter).
I tried using the must
clause as well in place of the filter inside the compound
clause but the outcome remains the same.
I tried also removing the second filtering (the one without the list) and I still get unfiltered results.
Am I doing something wrong? how can it correctly?
2
Answers
Ok, I should have found the reason of this behaviour and how to solve this.
As a default, MongoDB Atlas Search uses a as Search Analysers (for fields that are not vectors) the Standard Analyzer, in JSON:
The standard analyser
As a consequence, if the search term contains a space, it will split by spaces and search for ANY of the produced words.
To avoid this behaviour it is necessary to use the Keyword Analyser, that on the other hand uses the whole string as search item.
In the end, the Index definition should look like this:
In particular, the first part is the definition of the (custom) vector search while
specifies that we want to use the keyword analyser.
The OP points to using a keyword Analyzer
That would address the root cause of the issue: the text analyzer that was initially being used.
MongoDB Atlas Search uses different types of analyzers to process the text data. The analyzers define how the text data should be indexed and searched. It affects how MongoDB handles the splitting of text into terms (or tokens) and how it matches search queries against those terms.
The default analyzer is the Standard Analyzer, which splits text into terms based on word boundaries, often whitespace and punctuation. For instance, a search query with the term "value 2" would have been split into two separate terms: "value" and "2". This means it would match documents containing any of these separate terms, not necessarily the exact phrase "value 2".
The solution involved switching to the Keyword Analyzer. Unlike the standard analyzer, the keyword analyzer treats the entire text as a single term. This allows it to match the exact phrase in the search query, hence solving the issue you were encountering with spaces in search queries.
That means here updating their index mapping to include a specification for how the
my.field.name
field should be indexed, as shown in this fragment:lucene.keyword
" analyzer, meaning the entire string will be treated as a single term.Original answer:
I see the
compound
syntax as:The
filter
field within thecompound
stage should be an array of filter stages, but in your case, it includes a text filter and a field directly.To use multiple filters you will want to use
must
clause which is an array of filter stages that all must match.Each individual filter is a separate document in the
must
array, specifying the path and the query conditions for that path.The issue might be the incorrect use of text filters with non-text fields or due to some data mismatch.
For testing, run separate pipelines with individual filters to confirm if they are working as intended independently. That will help to identify if there is an issue with a specific filter.
Double-check the field paths
my.field.name
andmy.field.name2
to ensure they are correct and that they correspond to fields in your MongoDB documents.And make sure the
vector_index
is correctly configured and includes the fields you are trying to filter on.Also, sometimes, for text searches, it is better to use the
phrase
operator which matches the exact phrase. Try modifying your query with the phrase operator to see if it works.All this assumes you have documents that satisfy both conditions. Sometimes, it could be that there are no matching documents which meet all the specified conditions(!).
By doing some tests it seems that this error happens only when there are spaces in the string I am searching for (so like value 2 instead of value2). Is there any way to fix this?
It seems like it might be related to how MongoDB full-text search handles tokenization. When a text index is created, MongoDB tokenizes the content of the fields based on whitespace and some other delimiters, and creates an index on these tokens.
To work around this issue, you can use a
$regex
match in a$match
stage instead of a text search for querying strings with spaces. That can be added after your$search
stage to further filter the results based on the regex pattern.The
$search
stage is used to carry out the vector search without any filters.A
$match
stage is introduced after the$search
stage to filter the results based on the regex pattern and other conditions. That stage uses$in
operator formy.field.name
to match any of the values in the array and a simple equality match formy.field.name2
.The
$project
stage remains the same, projecting the search score and excluding theembedding
field.