Searching for entries in a keyword-field stored in an Azure Cognitive Search index not returning expected results when searching for them in a long text. Mutli word tokens as ‘microsoft azure’ are not returned as a match when looking in the text "This text contains microft azure"
Working with Azure Cogntive Search with the Python SDK. Say I’m building a search index where each document in the index has a "name" field. The name field (which can consist of multiple words) only makes sense if I tokenize the entire name is one token, so I use the "keyword_v2" tokenizer when building the analyzer for this field.
from azure.search.documents.indexes.models import CustomAnalyzer
# Define the custom analyzer for the Name field
name_analyzer = CustomAnalyzer(name="name_analyzer",tokenizer_name="keyword_v2",
token_filters=["lowercase"])`
# Specify the index schema
fields = [
SimpleField(name="key", type=SearchFieldDataType.String, key=True),
SearchableField(name="name", type=SearchFieldDataType.String, analyzer_name="name_analyzer", searchable=True)
]
This works as expected when I test the analyzer using the REST API. As an example I have the following indexed entries in the name-field: [‘microsoft azure’, ‘amazon aws’, ‘google cloud’]. The custom analyzer I set up correctly tokenizes each entry as one token and not as multiple tokens (ex. ‘microsoft’ and ‘azure’).
The problem occurs when I search for the stored names in a text.
text_example = "This is a text containing microsoft azure."
results = search_client.search(search_text=text_example, include_total_count=True, select= ['name'], search_fields= ['name'], highlight_fields= 'name', query_type= "full")
print ('Total Documents Matching Query:', results.get_count())
for result in results:
print(result)
I expect when I search for a name in the text_example it will return a hit on ‘microsoft azure’, but it doesn’t. It returns empty. I suspect because I use the same custom analyzer as both the index analyzer and search analyzer, it will tokenize the entire text_example as one token, which is not in the index. So it returns nothing.
Can I resolve this problem of searching for multiple word tokens in a long text in an efficient way using Azure Cogntive Search ?
2
Answers
For Index Analyzer its fine to use keyword_v2 if you are using filtering on that field. But if you want to satisfy your use case, would rather suggest to use a standard analyzer for both index and search and then use a phrase search that requires term.
So you text_example would be "This is a text containing "microsoft azure"."
To resolve this issue, you can try using a different analyzer for the search query.
You can create a custom analyzer that uses the
standard_v2
tokenizer for the search query and apply it to the search text.Below is the update Analyzer and Schema code snippet:
With the above schema I created an index and uploaded sample data:
With above setup I was able to get the required results.
Search Query Code:
Result: