I have created an azure search index with the below dataframe
.
Scenario 1: search_client.search('stand-up',top=3)
gives me all 3 rows from the index in the results,
but
Scenario 2: search_client.search('What do comics do?',top=3)
only gives me 1 result. (Images at the end of the question)
My question : Why is the search method not returning all the 3 rows in my Scenario 2 in spite of me specifying top=3. Is there a threshold of @search.score
that needs to be met for a row in order to be returned ? If yes, Can this threshold be controlled as a parameter in .search method?
I have already been through the method’s source code and don’t see any such parameter
.
Return for Scenario 1
.
Return for Scenario 2
.
.
Below is the full code to reproduce this issue
AZURE_SEARCH_SERVICE = 'to be filled as str'
AZURE_SEARCH_KEY = 'to be filled as str'
from azure.search.documents.indexes import SearchIndexClient
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes.models import *
from azure.search.documents import SearchClient
import pandas as pd
from uuid import uuid4
from azure.search.documents.models import QueryType, Vector
def create_search_index(index_name:str)->None:
index_client = SearchIndexClient(endpoint=f"https://{AZURE_SEARCH_SERVICE}.search.windows.net/",
credential=AzureKeyCredential(AZURE_SEARCH_KEY))
index = SearchIndex(
name=index_name,
fields=[
SimpleField(name="uuid", type="Edm.String", key=True),
SimpleField(name="Numb_Str", type="Edm.String", filterable=True, facetable=True),
SearchableField(name="Sent", type="Edm.String", analyzer_name="en.microsoft"),
SimpleField(name="Topic", type="Edm.String", filterable=True, facetable=True),
],
semantic_settings=SemanticSettings(
configurations=[SemanticConfiguration(
name='default',
prioritized_fields=PrioritizedFields(
title_field=None, prioritized_content_fields=[SemanticField(field_name='Sent')]))])
)
print(f"Creating {index} search index")
index_client.create_index(index)
def upload_to_created_index(index_name:str,df:pd.DataFrame)->None:
search_client = SearchClient(endpoint=f"https://{AZURE_SEARCH_SERVICE}.search.windows.net/",
index_name=index_name,
credential=AzureKeyCredential(AZURE_SEARCH_KEY))
sections = df.to_dict("records")
search_client.upload_documents(documents=sections)
#create df for uploading to search index
data = [{'uuid':str(uuid4()),'Numb_Str':'10','Sent':'Stand-up comedy is a comedic performance to a live audience in which the performer addresses the audience directly from the stage','Topic':'Standup'},
{'uuid':str(uuid4()),'Numb_Str':'20','Sent':'A stand-up defines their craft through the development of the routine or set','Topic':'Standup'},
{'uuid':str(uuid4()),'Numb_Str':'30', 'Sent':'Experienced stand-up comics with a popular following may produce a special.','Topic':'Standup'}]
df = pd.DataFrame(data)
pd.set_option('display.max_colwidth', None)
#create empty search index
create_search_index("test-simple2")
#upload df to created search index
upload_to_created_index('test-simple2',df)
#query the search index
search_client = SearchClient(
endpoint=f"https://{AZURE_SEARCH_SERVICE}.search.windows.net",
index_name='test-simple2',
credential=AzureKeyCredential(AZURE_SEARCH_KEY))
query_results = search_client.search('What do comics do?',top=3)
query_results = list(query_results)
#get query results in a df
df_results = pd.DataFrame(query_results)
df_results
.
If I try changing the .search method’s args to make it do a semantic search , I still get 1 result. I do it with the below
query_results = search_client.search('What do comics do?',
top=3,
query_type=QueryType.SEMANTIC,
query_language='en-us',
semantic_configuration_name="default")
2
Answers
This is expected behavior since you are using a full-text search using the BM25 scoring algorithm. When you query the term "stand-up", notice how this term occurs in each of the three documents hence why it’s being retrieved from the search service. When you query the term "What do comics do?", notice how only the third document is being returned as it’s the only document in the index that can retreive a search score or meet the keyword match regardless of what you state for your
top
query parameter. While the$top
query parameter does indicate the number of search results to retrieve, the search system will retrieve either the top value stated or less if no other documents in the index meet the scoring requirements.You may find Vector Search and Semantic search experiences interesting for your intended use case.
Semantic search will only re-rank the results of a BM25 query so it’s not going to help in this specific case, but I imagine vector search would give you what you’re looking for. In the Azure Cognitive Search service, vector search = similarity search. Vector search is in public preview and automatically available on most search services. The hard part is that you now have to vectorize content and queries, but your test data is small and you could plug it into the existing samples for proof-of-concept testing. This repo has the code samples for REST, C#, Python, and JavaScript: https://github.com/Azure/cognitive-search-vector-pr