skip to Main Content

I am using the following code to parse JSON multiline objects separated by comma from a webscraped string stored in a .json file:

import json

def stream_read_json(fn):
    start_pos = 0
    with open(fn, 'r', encoding='utf-8') as f:
        while True:
            try:
                obj = json.load(f)
                yield obj
                return
            except json.JSONDecodeError as e:
                f.seek(start_pos)
                json_str = f.read(e.pos)
                obj = json.loads(json_str, encoding = 'utf-8')
                start_pos += e.pos
                yield obj

The first object is parsed correctly; the next ones are not.
While testing random values of f.seek(start_pos), I see there is an inconsistency with the index found by except json.JSONDecodeError as e:. Why is this index different than the number of characters shown when I select on the IDE the text up until the character where the JSON object ends on the file?

How can I ensure the objects will be parsed correctly?

I tried to get f.seek(start_pos) for the second JSON object at debug prompt, but it differs greatly from e.pos thrown by the error.

A sample JSON is here:

{
  "user": {
    "id": 1,
    "profile": {
      "name": "Alice",
      "age": 30
    }
  },
  "product": {
    "sku": "A1234",
    "details": {
      "name": "Laptop",
      "price": 999.99
    }
  }
},
{
  "user": {
    "id": 2,
    "profile": {
      "name": "Bob",
      "age": 22
    }
  },
  "product": {
    "sku": "A123w",
    "details": {
      "name": "Laptop",
      "price": 9.99
    }
  }
}

2

Answers


  1. This is definitely not the way, how it should be done, but I’ll suggest a workaround for your certain situation.

    json.load(f) returns JSONDecodeError: Extra data: line

    The problem is that your "json" is not really a json, because there are missing brackets [] for list of objects and a lot of duplicated keys. But as a workaround you can do the following:

    import json
    
    with open("test.json", "r") as file:
        str_data: str = file.read()
        data: list[dict] = json.loads(f"[{str_data}]")
    
    for item in data:
        ...
    
    Login or Signup to reply.
  2. Yes, this can be done, but I’m not recommending this.

    Your problems are:

    1. You have added encoding='utf-8' to json.loads()
    2. You are ignoring the comma that is separating your json objects

    I’ve tested this code:

    def stream_read_json(fn):
        start_pos = 0
        with open(fn, 'r', encoding='utf-8') as f:
            while True:
                try:
                    obj = json.load(f)
                    yield obj
                    return
                except json.JSONDecodeError as e:
                    f.seek(start_pos)
                    json_str = f.read(e.pos)
                    obj = json.loads(json_str)
                    yield obj
                    f.read(1)
                    start_pos += e.pos + 1
    

    I’ve also tested this code which only uses values recovered from f.tell() as the parameter to f.seek() as recommended in the documentation.

    def stream_read_json(fn):
        with open(fn, 'r', encoding='utf-8') as f:
            while True:
                try:
                    start_pos = f.tell()
                    obj = json.load(f)
                    yield obj
                    return
                except json.JSONDecodeError as e:
                    f.seek(start_pos)
                    json_str = f.read(e.pos)
                    obj = json.loads(json_str)
                    yield obj
                    f.read(1)
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search