skip to Main Content

I have a large json file (about 11,600 records) and I am trying to parse it using ijson. However, the for loop breaks because of one faulty json record. Is there a way to continue the iteration by skipping that record and moving on using ijson or any other python library? Here’s the code snippet.

try:
      for row in ijson.items(json_file, 'rows.item'):
        data = row
        try:
            super_df=dependency_latest_version(data, version)
        except Exception as e:
            print(e)
except ijson.common.IncompleteJSONError:
    traceback.print_exc()

This generates the following error:

for row in ijson.items(json_file, 'rows.item'):
ijson.common.IncompleteJSONError: parse error: after array element, I expect ',' or ']'
          dmeFilename":"README.md"}}":{"integrity":"sha512-204Fg2wwe1Q
                     (right here) ------^

I tried iterating through the json file line by line and then using json.loads(line) but it didn’t help since the entire json file was being read as a single line. Are there any other alternatives? Thank you.

2

Answers


  1. Is there a way to continue the iteration by skipping that record and moving on using ijson […]?

    With ijson: no.

    While to you it might be obvious that there is a next element you should jump to upon a faulty element in an array, to the underlying parser such statement is not obvious: if there was a parsing error, you don’t know what the fix is to begin with. And because there’s no fix available, the parser can’t make any further assumptions about what follows on, so there is no more concept of there being a next element, or a new object, or anything because you are in an invalid state. The only thing the parser can do is give up.

    In your case the error message you get (emitted by the yajl library, which ijson interally uses) says that it’s expecting a , or a ] to respectively continue or end a list of elements that hasn’t been correctly continued or ended (and note that such list is not necessarily the list you’re iterating over, hard to tell from the original question as there is no example JSON document shown). As such, you can’t expect the parser to handle the error automatically, and to "skip to the next element", because, like mentioned before, the parser can’t guess what the rest of the stream is supposed to be.

    Are there any other alternatives?

    While I maintain ijson and don’t know the particular about other libraries, I’m pretty sure they’ll all have similar issues. Your best bet is to fix the document where it’s being generated. Alternatively, if you know exactly where your issue is you can "patch" document before passing it to ijson; see https://github.com/ICRAR/ijson/issues/33#issuecomment-698266199 and https://github.com/ICRAR/ijson/issues/25#issuecomment-610214101 for ideas and examples. Note that, strictly speaking, ATM you are not gi a JSON document, you are handling a document that looks very much like a JSON document.

    Login or Signup to reply.
  2. I faced a similar situation once. Later I had to fix the document before parsing as @Rodrigo mentioned. It is pointless to try to fix the library since the library is doing exactly what it is supposed to do and it should not be able to parse the document unless it is a proper json document.

    The way I fixed it for my case is that I tried to look for a pattern where the supposed json document has improper formatting and wrote a script to fix those. After doing this preprocessing, it becomes a proper json document and at this point it can be parsed with ijson or any similar library.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search