I am trying to push a index (with embeddings) to Azure cognitive search. The following code is what pushes the index to cognitive search:
#Upload some documents to the index
with open('index.json', 'r') as file:
documents = json.load(file)
search_client = SearchClient(endpoint=service_endpoint, index_name=index_name, credential=credential)
result = search_client.upload_documents(documents, timeout = 50)
print(f"Uploaded {len(documents)} documents")
The code works whenever the size of the ‘index.json’ is small. (have tried it and it successfully pushes the data to Azure cognitive search). But it does not work whenever the size of ‘index.json’ is large. Right now I am working with an ‘index.json’ of 69mb.
I receive the following error when running the code:
ServiceRequestError Traceback (most recent call last)
Cell In[21], line 5
3 documents = json.load(file)
4 search_client = SearchClient(endpoint=service_endpoint, index_name=index_name, credential=credential)
----> 5 result = search_client.upload_documents(documents, timeout = 50)
6 print(f"Uploaded {len(documents)} documents")
File /usr/local/lib/python3.11/site-packages/azure/search/documents/_search_client.py:543, in SearchClient.upload_documents(self, documents, **kwargs)
540 batch.add_upload_actions(documents)
542 kwargs["headers"] = self._merge_client_headers(kwargs.get("headers"))
--> 543 results = self.index_documents(batch, **kwargs)
544 return cast(List[IndexingResult], results)
File /usr/local/lib/python3.11/site-packages/azure/core/tracing/decorator.py:78, in distributed_trace..decorator..wrapper_use_tracer(*args, **kwargs)
76 span_impl_type = settings.tracing_implementation()
77 if span_impl_type is None:
---> 78 return func(*args, **kwargs)
80 # Merge span is parameter is set, but only if no explicit parent are passed
81 if merge_span and not passed_in_parent:
File /usr/local/lib/python3.11/site-packages/azure/search/documents/_search_client.py:641, in SearchClient.index_documents(self, batch, **kwargs)
631 @distributed_trace
632 def index_documents(self, batch: IndexDocumentsBatch, **kwargs: Any) -> List[IndexingResult]:
633 """Specify a document operations to perform as a batch.
...
--> 381 raise error
382 if _is_rest(request):
383 from azure.core.rest._requests_basic import RestRequestsTransportResponse
ServiceRequestError: EOF occurred in violation of protocol (_ssl.c:2427)
Anyone knows how to fix this error, so the code does push the data to Azure cognitive search?
2
Answers
The issue you’re facing is likely due to the large size of the ‘index.json’ file. Azure Cognitive Search has some limitations on the number of documents and the maximum document size in a single request. See https://learn.microsoft.com/en-us/azure/search/search-limits-quotas-capacity#document-size-limits-per-api-call
Can you try batching the documents in smaller intervals?
You can use the
SearchIndexingBufferedSender
instance. The SearchIndexingBufferedSender will handle batching the operations efficiently.Sample:
Based on the information, I have reproduced the scenario.
I have tested with multiple Json file size, and it seems the maximum allowed limit is just under
64MB size
and32000 Documents(indexing actions per request)
.One possible solution is to split your data into smaller chunks before uploading.
Here’s a modified version of the upload code that splits the data into chunks of 10000 documents each:
You can modify the above code according to your document and file size for optimal chunks.