skip to Main Content

I am trying to push a index (with embeddings) to Azure cognitive search. The following code is what pushes the index to cognitive search:

 #Upload some documents to the index
    with open('index.json', 'r') as file:  
        documents = json.load(file)  
    search_client = SearchClient(endpoint=service_endpoint, index_name=index_name, credential=credential)
    result = search_client.upload_documents(documents, timeout = 50)  
    print(f"Uploaded {len(documents)} documents") 

The code works whenever the size of the ‘index.json’ is small. (have tried it and it successfully pushes the data to Azure cognitive search). But it does not work whenever the size of ‘index.json’ is large. Right now I am working with an ‘index.json’ of 69mb.

I receive the following error when running the code:

ServiceRequestError                       Traceback (most recent call last)
Cell In[21], line 5
      3     documents = json.load(file)  
      4 search_client = SearchClient(endpoint=service_endpoint, index_name=index_name, credential=credential)
----> 5 result = search_client.upload_documents(documents, timeout = 50)  
      6 print(f"Uploaded {len(documents)} documents") 

File /usr/local/lib/python3.11/site-packages/azure/search/documents/_search_client.py:543, in SearchClient.upload_documents(self, documents, **kwargs)
    540 batch.add_upload_actions(documents)
    542 kwargs["headers"] = self._merge_client_headers(kwargs.get("headers"))
--> 543 results = self.index_documents(batch, **kwargs)
    544 return cast(List[IndexingResult], results)

File /usr/local/lib/python3.11/site-packages/azure/core/tracing/decorator.py:78, in distributed_trace..decorator..wrapper_use_tracer(*args, **kwargs)
     76 span_impl_type = settings.tracing_implementation()
     77 if span_impl_type is None:
---> 78     return func(*args, **kwargs)
     80 # Merge span is parameter is set, but only if no explicit parent are passed
     81 if merge_span and not passed_in_parent:

File /usr/local/lib/python3.11/site-packages/azure/search/documents/_search_client.py:641, in SearchClient.index_documents(self, batch, **kwargs)
    631 @distributed_trace
    632 def index_documents(self, batch: IndexDocumentsBatch, **kwargs: Any) -> List[IndexingResult]:
    633     """Specify a document operations to perform as a batch.
...
--> 381     raise error
    382 if _is_rest(request):
    383     from azure.core.rest._requests_basic import RestRequestsTransportResponse

ServiceRequestError: EOF occurred in violation of protocol (_ssl.c:2427)

Anyone knows how to fix this error, so the code does push the data to Azure cognitive search?

2

Answers


  1. The issue you’re facing is likely due to the large size of the ‘index.json’ file. Azure Cognitive Search has some limitations on the number of documents and the maximum document size in a single request. See https://learn.microsoft.com/en-us/azure/search/search-limits-quotas-capacity#document-size-limits-per-api-call

    Can you try batching the documents in smaller intervals?
    You can use the SearchIndexingBufferedSender instance. The SearchIndexingBufferedSender will handle batching the operations efficiently.

    Sample:

    from azure.core.credentials import AzureKeyCredential  
    from azure.search.documents import SearchIndexingBufferedSender  
      
    # Load documents from index.json  
    with open('index.json', 'r') as file:  
        documents = json.load(file)  
      
    def upload_documents():  
        with SearchIndexingBufferedSender(service_endpoint, index_name, AzureKeyCredential(key)) as batch_client:  
            # Add upload actions for all documents  
            batch_client.upload_documents(documents=documents)  
      
    if __name__ == "__main__":  
        upload_documents()
    
    Login or Signup to reply.
  2. Based on the information, I have reproduced the scenario.
    I have tested with multiple Json file size, and it seems the maximum allowed limit is just under 64MB size and 32000 Documents(indexing actions per request).

    One possible solution is to split your data into smaller chunks before uploading.

    Here’s a modified version of the upload code that splits the data into chunks of 10000 documents each:

    with open('data2.json', 'r') as f:
        documents = json.load(f)
    
    # Split the data into chunks 
    chunks = [documents[i:i + 10000] for i in range(0, len(documents), 10000)]
    
    # Upload the data
    for chunk in chunks:
        result = search_client.upload_documents(chunk)
        print(f"Uploaded {len(chunk)} documents")
    

    enter image description here

    enter image description here

    You can modify the above code according to your document and file size for optimal chunks.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search