I am encountering an issue with atomic updates in my Solr core while using a batch processing python script. The script sends data in batches of 10000 documents to the Solr core for atomic updates, but I’m facing a problem when one of the documents in a batch gets removed from the core during the script’s execution. This leads to a 400 error response, and none of the 10000 documents in that particular batch get updated. I want to find a solution that prevents a single missing document from affecting the entire batch of atomic updates. It’s worth mentioning that sending updates individually for each document is not feasible due to performance concerns.
Here’s how I’m currently sending the data:
headers = {"Content-type": "application/json"}
response = requests.post(solr_core_url, data=json_data, headers=headers)
where json_data contains the update data for all 10000 documents. While this approach works most of the time, it’s susceptible to failures when documents expire and are removed by another process while the script is running.
The error message I encounter is as follows:
{"responseHeader":{"status":400,"QTime":20},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","org.apache.solr.common.SolrException"],"msg":"[doc=93961393634] missing required field: datatype","code":400}}
My questions are:
How can I modify the script or my approach to handle situations where a single missing document causes the entire batch of atomic updates to fail with a 400 error?
Are there any techniques or strategies to ensure that the atomic updates are resilient to the possibility of some documents expiring during the script’s execution?
Additional info, I’m using solr version 6.5.0
2
Answers
The error
[doc=93961393634] missing required field: datatype
sounds like a field is missing in one of the documents. So one approach would be to make sure this field exists on the document before you include it in yourjson_data
.Another approach would be to make the batches smaller, e.g. from 10,000 to say 500 records per-batch. It might be a bit slower that submitting all 10,000 documents at once but shouldn’t be too bad plus it will allow you to spot which batch has the bad records easily.
Do you need Atomic Update in the first place? A common scenario for Atomic Update is having multiple indexing processes, each only having knowledge of parts of a document (e.g. for a shop it would be one updating core product information and one only updating the price field(s)). Do you have something similar? If not, or if you experience these consistency/race issues frequently I would look at other ways to update.
A strategy to handle failures due to missing documents (if the source of your atomic update lags behind some higher priority update from another process) would be to get the ID of the failed document from the Solr response and then locate and then split your batch at the position of that document, i.e. send a new smaller batch of documents before that position and one with all documents after that position. This can be applied repeatedly/recursively.