I have a large json file (about 11,600 records) and I am trying to parse it using ijson. However, the for loop breaks because of one faulty json record. Is there a way to continue the iteration by skipping that record and moving on using ijson or any other python library? Here’s the code snippet.
try:
for row in ijson.items(json_file, 'rows.item'):
data = row
try:
super_df=dependency_latest_version(data, version)
except Exception as e:
print(e)
except ijson.common.IncompleteJSONError:
traceback.print_exc()
This generates the following error:
for row in ijson.items(json_file, 'rows.item'):
ijson.common.IncompleteJSONError: parse error: after array element, I expect ',' or ']'
dmeFilename":"README.md"}}":{"integrity":"sha512-204Fg2wwe1Q
(right here) ------^
I tried iterating through the json file line by line and then using json.loads(line) but it didn’t help since the entire json file was being read as a single line. Are there any other alternatives? Thank you.
2
Answers
With
ijson
: no.While to you it might be obvious that there is a next element you should jump to upon a faulty element in an array, to the underlying parser such statement is not obvious: if there was a parsing error, you don’t know what the fix is to begin with. And because there’s no fix available, the parser can’t make any further assumptions about what follows on, so there is no more concept of there being a next element, or a new object, or anything because you are in an invalid state. The only thing the parser can do is give up.
In your case the error message you get (emitted by the
yajl
library, whichijson
interally uses) says that it’s expecting a,
or a]
to respectively continue or end a list of elements that hasn’t been correctly continued or ended (and note that such list is not necessarily the list you’re iterating over, hard to tell from the original question as there is no example JSON document shown). As such, you can’t expect the parser to handle the error automatically, and to "skip to the next element", because, like mentioned before, the parser can’t guess what the rest of the stream is supposed to be.While I maintain
ijson
and don’t know the particular about other libraries, I’m pretty sure they’ll all have similar issues. Your best bet is to fix the document where it’s being generated. Alternatively, if you know exactly where your issue is you can "patch" document before passing it toijson
; see https://github.com/ICRAR/ijson/issues/33#issuecomment-698266199 and https://github.com/ICRAR/ijson/issues/25#issuecomment-610214101 for ideas and examples. Note that, strictly speaking, ATM you are not gi a JSON document, you are handling a document that looks very much like a JSON document.I faced a similar situation once. Later I had to fix the document before parsing as @Rodrigo mentioned. It is pointless to try to fix the library since the library is doing exactly what it is supposed to do and it should not be able to parse the document unless it is a proper
json
document.The way I fixed it for my case is that I tried to look for a pattern where the supposed
json
document has improper formatting and wrote a script to fix those. After doing this preprocessing, it becomes a properjson
document and at this point it can be parsed withijson
or any similar library.