Inconsistency of character indexes while trying to parse multiple JSON in a file

MartinHorst
July 25, 2024
109 views
1 vote
2 Answers

I am using the following code to parse JSON multiline objects separated by comma from a webscraped string stored in a .json file:

import json

def stream_read_json(fn):
    start_pos = 0
    with open(fn, 'r', encoding='utf-8') as f:
        while True:
            try:
                obj = json.load(f)
                yield obj
                return
            except json.JSONDecodeError as e:
                f.seek(start_pos)
                json_str = f.read(e.pos)
                obj = json.loads(json_str, encoding = 'utf-8')
                start_pos += e.pos
                yield obj

The first object is parsed correctly; the next ones are not.
While testing random values of f.seek(start_pos), I see there is an inconsistency with the index found by except json.JSONDecodeError as e:. Why is this index different than the number of characters shown when I select on the IDE the text up until the character where the JSON object ends on the file?

How can I ensure the objects will be parsed correctly?

I tried to get f.seek(start_pos) for the second JSON object at debug prompt, but it differs greatly from e.pos thrown by the error.

A sample JSON is here:

{
  "user": {
    "id": 1,
    "profile": {
      "name": "Alice",
      "age": 30
    }
  },
  "product": {
    "sku": "A1234",
    "details": {
      "name": "Laptop",
      "price": 999.99
    }
  }
},
{
  "user": {
    "id": 2,
    "profile": {
      "name": "Bob",
      "age": 22
    }
  },
  "product": {
    "sku": "A123w",
    "details": {
      "name": "Laptop",
      "price": 9.99
    }
  }
}

Answers

- VictorEgiazarian
- July 25, 2024 at 11:05 pm
- 0 votes
0
This is definitely not the way, how it should be done, but I’ll suggest a workaround for your certain situation.

json.load(f) returns JSONDecodeError: Extra data: line

The problem is that your "json" is not really a json, because there are missing brackets [] for list of objects and a lot of duplicated keys. But as a workaround you can do the following:
```
import json

with open("test.json", "r") as file:
    str_data: str = file.read()
    data: list[dict] = json.loads(f"[{str_data}]")

for item in data:
    ...
```
Login or Signup to reply.

Yes, this can be done, but I’m not recommending this.

Your problems are:

You have added encoding='utf-8' to json.loads()
You are ignoring the comma that is separating your json objects

I’ve tested this code:

def stream_read_json(fn):
    start_pos = 0
    with open(fn, 'r', encoding='utf-8') as f:
        while True:
            try:
                obj = json.load(f)
                yield obj
                return
            except json.JSONDecodeError as e:
                f.seek(start_pos)
                json_str = f.read(e.pos)
                obj = json.loads(json_str)
                yield obj
                f.read(1)
                start_pos += e.pos + 1

I’ve also tested this code which only uses values recovered from f.tell() as the parameter to f.seek() as recommended in the documentation.

def stream_read_json(fn):
    with open(fn, 'r', encoding='utf-8') as f:
        while True:
            try:
                start_pos = f.tell()
                obj = json.load(f)
                yield obj
                return
            except json.JSONDecodeError as e:
                f.seek(start_pos)
                json_str = f.read(e.pos)
                obj = json.loads(json_str)
                yield obj
                f.read(1)

Please signup or login to give your own answer.

Click here to cancel reply.