I wanted to create a Azure Cognitive Search to query course catalogue using vectors. I have pandas dataframe called courses_pd and it has two columns, ‘content’ and ’embeddings’ which is the embedding I have created using model = SentenceTransformer(‘all-MiniLM-L6-v2’) and then model.encode(x).
Below is the python code-snippet which creates the index in ACS and uploades the documents from Azure databricks notebook.
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
SimpleField,
SearchFieldDataType,
SearchableField,
SearchField,
VectorSearch,
HnswAlgorithmConfiguration,
VectorSearchProfile,
SemanticConfiguration,
SemanticPrioritizedFields,
SemanticField,
SemanticSearch,
SearchIndex,
AzureOpenAIVectorizer,
AzureOpenAIVectorizerParameters
)
# Azure Cognitive Search setup
service_endpoint = "https://yourserviceendpoint.search.windows.net"
admin_key = "ABC"
index_name = "courses-index"
# Wrap admin_key in AzureKeyCredential
credential = AzureKeyCredential(admin_key)
# Create the index client with AzureKeyCredential
index_client = SearchIndexClient(endpoint=service_endpoint, credential=credential)
# Define the index schema
fields = [
SimpleField(name="id", type="Edm.String", key=True),
SimpleField(name="content", type="Edm.String"),
SearchField(
name="embedding",
type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
searchable=True,
vector_search_dimensions=384,
vector_search_profile_name="myHnswProfile"
)
# SearchField(name="embedding", type='Collection(Edm.Single)', searchable=True)
]
# Configure the vector search configuration
vector_search = VectorSearch(
algorithms=[
HnswAlgorithmConfiguration(
name="myHnsw"
)
],
profiles=[
VectorSearchProfile(
name="myHnswProfile",
algorithm_configuration_name="myHnsw"
)
]
)
# Create the index
index = SearchIndex(
name=index_name,
fields=fields,
vector_search=vector_search
)
# Send the index creation request
index_client.create_index(index)
print(f"Index '{index_name}' created successfully.")
And then upload the document using the below code:
from azure.search.documents import SearchClient
# Generate embeddings and upload data
search_client = SearchClient(endpoint=service_endpoint, index_name=index_name, credential=credential)
documents = []
for i, row in courses_pd.iterrows():
document = {
"id": str(i),
"content": row["content"],
"embedding": row["embeddings"] # Ensure embeddings are a list of floats
}
documents.append(document)
# Upload documents to the index
search_client.upload_documents(documents=documents)
print(f"Uploaded {len(documents)} documents to Azure Cognitive Search.")
Now, when I am querying the search_client, I am getting multiple errors, if i search using raw string, or after doing a model.encode(str), it return a <iterator object azure.core.paging.ItemPaged at 0x7fcf9f086220> but search is getting failed in the log.
from azure.search.documents.models import VectorQuery
# Generate embedding for the query
query = "machine learning"
query_embedding = model.encode(query).tolist() # Convert to list of floats
# Create a VectorQuery
vector_query = VectorQuery(
vector=query_embedding,
k=3, # Number of nearest neighbors
fields="embedding" # Name of the field where embeddings are stored
)
# Perform the search
results = search_client.search(
vector_queries=[vector_query],
select=["id", "content"]
)
# Print the results
for result in results:
print(f"ID: {result['id']}, Content: {result['content']}")
The error then says:
vector is not a known attribute of class <class 'azure.search.documents._generated.models._models_py3.VectorQuery'> and will be ignored
k is not a known attribute of class <class 'azure.search.documents._generated.models._models_py3.VectorQuery'> and will be ignored
HttpResponseError: (InvalidRequestParameter) The vector query's 'kind' parameter is not set.
Then tried with providing kind = 'vector'
as a parameter in VectorQuery, then it says kind is not set!
Documents are getting uploaded and index is created as I can see in the portal.
I must be doing something wrong, either the way I have setup the search index or the way I am querying the index, the documentation and github codebase is not providing anything around this, so seeking help from the community, new to this.
2
Answers
had a look at the code, i think you almost got the index created correctly but the key step missing is vectorizing the data and store the embedding for semantic search. These steps are a little long, worth looking at the example here, it is a simple version of ingesting document and setuping index.
https://github.com/Azure/azure-search-vector-samples/blob/main/demo-python/code/basic-vector-workflow/azure-search-vector-python-sample.ipynb
lots of more advanced examples here:
https://github.com/Azure/azure-search-vector-samples/tree/main/demo-python
At a very high level:
It was an incorrect implementation of
VectorQuery
in Azure Cognitive Search. TheVectorQuery
object requires specific fields likekind
,vector
,k
, andfields
, but it seems the error suggests these parameters are not recognized.This type of issues comes when using outdated SDK versions which does not have fully support vector search and also check that your index schema correctly defines the vector field.
Firstly, use the latest version of the
azure-search-documents
SDK.VectorQuery
class is imported directly fromazure.search.documents.models
and must be initialized withkind='vector'
explicitly.Update the vector search query:
Result: