Question posted in Json
Our archive of expertly curated questions and answers provides insights and solutions to common problems related to this popular data interchange format. From parsing and manipulating JSON data to integrating it with various programming languages and web services, our archive has got you covered. Start exploring today and take your JSON skills to the next level

Inconsistency of character indexes while trying to parse multiple JSON in a file

MartinHorst
July 25, 2024
120 views
1 vote
2 Answers

I am using the following code to parse JSON multiline objects separated by comma from a webscraped string stored in a .json file:

import json

def stream_read_json(fn):
    start_pos = 0
    with open(fn, 'r', encoding='utf-8') as f:
        while True:
            try:
                obj = json.load(f)
                yield obj
                return
            except json.JSONDecodeError as e:
                f.seek(start_pos)
                json_str = f.read(e.pos)
                obj = json.loads(json_str, encoding = 'utf-8')
                start_pos += e.pos
                yield obj

The first object is parsed correctly; the next ones are not.
While testing random values of f.seek(start_pos), I see there is an inconsistency with the index found by except json.JSONDecodeError as e:. Why is this index different than the number of characters shown when I select on the IDE the text up until the character where the JSON object ends on the file?

How can I ensure the objects will be parsed correctly?

I tried to get f.seek(start_pos) for the second JSON object at debug prompt, but it differs greatly from e.pos thrown by the error.

A sample JSON is here:

{
  "user": {
    "id": 1,
    "profile": {
      "name": "Alice",
      "age": 30
    }
  },
  "product": {
    "sku": "A1234",
    "details": {
      "name": "Laptop",
      "price": 999.99
    }
  }
},
{
  "user": {
    "id": 2,
    "profile": {
      "name": "Bob",
      "age": 22
    }
  },
  "product": {
    "sku": "A123w",
    "details": {
      "name": "Laptop",
      "price": 9.99
    }
  }
}

Answers

- VictorEgiazarian
- July 25, 2024 at 11:05 pm
- 0 votes
0
This is definitely not the way, how it should be done, but I’ll suggest a workaround for your certain situation.

json.load(f) returns JSONDecodeError: Extra data: line

The problem is that your "json" is not really a json, because there are missing brackets [] for list of objects and a lot of duplicated keys. But as a workaround you can do the following:
```
import json

with open("test.json", "r") as file:
    str_data: str = file.read()
    data: list[dict] = json.loads(f"[{str_data}]")

for item in data:
    ...
```
Login or Signup to reply.

Yes, this can be done, but I’m not recommending this.

Your problems are:

You have added encoding='utf-8' to json.loads()
You are ignoring the comma that is separating your json objects

I’ve tested this code:

def stream_read_json(fn):
    start_pos = 0
    with open(fn, 'r', encoding='utf-8') as f:
        while True:
            try:
                obj = json.load(f)
                yield obj
                return
            except json.JSONDecodeError as e:
                f.seek(start_pos)
                json_str = f.read(e.pos)
                obj = json.loads(json_str)
                yield obj
                f.read(1)
                start_pos += e.pos + 1

I’ve also tested this code which only uses values recovered from f.tell() as the parameter to f.seek() as recommended in the documentation.

def stream_read_json(fn):
    with open(fn, 'r', encoding='utf-8') as f:
        while True:
            try:
                start_pos = f.tell()
                obj = json.load(f)
                yield obj
                return
            except json.JSONDecodeError as e:
                f.seek(start_pos)
                json_str = f.read(e.pos)
                obj = json.loads(json_str)
                yield obj
                f.read(1)

Please signup or login to give your own answer.

Click here to cancel reply.