skip to Main Content

I wanted to create a Azure Cognitive Search to query course catalogue using vectors. I have pandas dataframe called courses_pd and it has two columns, ‘content’ and ’embeddings’ which is the embedding I have created using model = SentenceTransformer(‘all-MiniLM-L6-v2’) and then model.encode(x).

Below is the python code-snippet which creates the index in ACS and uploades the documents from Azure databricks notebook.

from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SimpleField,
    SearchFieldDataType,
    SearchableField,
    SearchField,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    SemanticConfiguration,
    SemanticPrioritizedFields,
    SemanticField,
    SemanticSearch,
    SearchIndex,
    AzureOpenAIVectorizer,
    AzureOpenAIVectorizerParameters
)


# Azure Cognitive Search setup
service_endpoint = "https://yourserviceendpoint.search.windows.net"
admin_key = "ABC"
index_name = "courses-index"

# Wrap admin_key in AzureKeyCredential
credential = AzureKeyCredential(admin_key)

# Create the index client with AzureKeyCredential
index_client = SearchIndexClient(endpoint=service_endpoint, credential=credential)

# Define the index schema
fields = [
    SimpleField(name="id", type="Edm.String", key=True),
    SimpleField(name="content", type="Edm.String"),
    SearchField(
        name="embedding", 
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
        searchable=True, 
        vector_search_dimensions=384, 
        vector_search_profile_name="myHnswProfile"
        )
    # SearchField(name="embedding", type='Collection(Edm.Single)', searchable=True)
]

# Configure the vector search configuration  
vector_search = VectorSearch(
    algorithms=[
        HnswAlgorithmConfiguration(
            name="myHnsw"
        )
    ],
    profiles=[
        VectorSearchProfile(
            name="myHnswProfile",
            algorithm_configuration_name="myHnsw"
        )
    ]
)

# Create the index
index = SearchIndex(
    name=index_name,
    fields=fields,
    vector_search=vector_search
)

# Send the index creation request
index_client.create_index(index)

print(f"Index '{index_name}' created successfully.")

And then upload the document using the below code:

from azure.search.documents import SearchClient

# Generate embeddings and upload data
search_client = SearchClient(endpoint=service_endpoint, index_name=index_name, credential=credential)

documents = []
for i, row in courses_pd.iterrows():
    document = {
        "id": str(i),
        "content": row["content"],
        "embedding": row["embeddings"]  # Ensure embeddings are a list of floats
    }
    documents.append(document)

# Upload documents to the index
search_client.upload_documents(documents=documents)
print(f"Uploaded {len(documents)} documents to Azure Cognitive Search.")

Now, when I am querying the search_client, I am getting multiple errors, if i search using raw string, or after doing a model.encode(str), it return a <iterator object azure.core.paging.ItemPaged at 0x7fcf9f086220> but search is getting failed in the log.

from azure.search.documents.models import VectorQuery

# Generate embedding for the query
query = "machine learning"
query_embedding = model.encode(query).tolist()  # Convert to list of floats

# Create a VectorQuery
vector_query = VectorQuery(
    vector=query_embedding,
    k=3,  # Number of nearest neighbors
    fields="embedding"  # Name of the field where embeddings are stored
)

# Perform the search
results = search_client.search(
    vector_queries=[vector_query],
    select=["id", "content"]
)

# Print the results
for result in results:
    print(f"ID: {result['id']}, Content: {result['content']}")

The error then says:

vector is not a known attribute of class <class 'azure.search.documents._generated.models._models_py3.VectorQuery'> and will be ignored
k is not a known attribute of class <class 'azure.search.documents._generated.models._models_py3.VectorQuery'> and will be ignored
HttpResponseError: (InvalidRequestParameter) The vector query's 'kind' parameter is not set.

Then tried with providing kind = 'vector' as a parameter in VectorQuery, then it says kind is not set!

Documents are getting uploaded and index is created as I can see in the portal.
enter image description here

I must be doing something wrong, either the way I have setup the search index or the way I am querying the index, the documentation and github codebase is not providing anything around this, so seeking help from the community, new to this.

2

Answers


  1. had a look at the code, i think you almost got the index created correctly but the key step missing is vectorizing the data and store the embedding for semantic search. These steps are a little long, worth looking at the example here, it is a simple version of ingesting document and setuping index.

    https://github.com/Azure/azure-search-vector-samples/blob/main/demo-python/code/basic-vector-workflow/azure-search-vector-python-sample.ipynb

    lots of more advanced examples here:
    https://github.com/Azure/azure-search-vector-samples/tree/main/demo-python

    At a very high level:

    • a: create an index with text field and vector fields
    • b1: prepare docs with corresponding embeddings, insert docs (with embedding into AI Search)
    • b2: or create an indexer to carry out embedding generation
    • c: query the data
    Login or Signup to reply.
  2. It was an incorrect implementation of VectorQuery in Azure Cognitive Search. The VectorQuery object requires specific fields like kind, vector, k, and fields, but it seems the error suggests these parameters are not recognized.

    This type of issues comes when using outdated SDK versions which does not have fully support vector search and also check that your index schema correctly defines the vector field.

    Firstly, use the latest version of the azure-search-documents SDK.

    • VectorQuery class is imported directly from azure.search.documents.models and must be initialized with kind='vector' explicitly.

    Update the vector search query:

    from azure.search.documents import SearchClient
    
    # Generate embedding for the query
    query = "machine learning"
    query_embedding = model.encode(query).tolist()  # Convert embedding to a list of floats
    
    # Perform the search
    results = search_client.search(
        vector_queries=[
            {
                "kind": "vector",  # Specify the query type
                "value": query_embedding,  # Provide the embedding vector
                "fields": ["embedding"],  # Specify the embedding field
                "k": 3  # Number of nearest neighbors to retrieve
            }
        ],
        select=["id", "content"]  # Fields to retrieve in the results
    )
    
    # Print the results
    for result in results:
        print(f"ID: {result['id']}, Content: {result['content']}")
    

    Result:

    enter image description here

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search