I am using the following code to parse JSON multiline objects separated by comma from a webscraped string stored in a .json file:
import json
def stream_read_json(fn):
start_pos = 0
with open(fn, 'r', encoding='utf-8') as f:
while True:
try:
obj = json.load(f)
yield obj
return
except json.JSONDecodeError as e:
f.seek(start_pos)
json_str = f.read(e.pos)
obj = json.loads(json_str, encoding = 'utf-8')
start_pos += e.pos
yield obj
The first object is parsed correctly; the next ones are not.
While testing random values of f.seek(start_pos)
, I see there is an inconsistency with the index found by except json.JSONDecodeError as e:
. Why is this index different than the number of characters shown when I select on the IDE the text up until the character where the JSON object ends on the file?
How can I ensure the objects will be parsed correctly?
I tried to get f.seek(start_pos)
for the second JSON object at debug prompt, but it differs greatly from e.pos
thrown by the error.
A sample JSON is here:
{
"user": {
"id": 1,
"profile": {
"name": "Alice",
"age": 30
}
},
"product": {
"sku": "A1234",
"details": {
"name": "Laptop",
"price": 999.99
}
}
},
{
"user": {
"id": 2,
"profile": {
"name": "Bob",
"age": 22
}
},
"product": {
"sku": "A123w",
"details": {
"name": "Laptop",
"price": 9.99
}
}
}
2
Answers
This is definitely not the way, how it should be done, but I’ll suggest a workaround for your certain situation.
The problem is that your "json" is not really a json, because there are missing brackets
[]
for list of objects and a lot of duplicated keys. But as a workaround you can do the following:Yes, this can be done, but I’m not recommending this.
Your problems are:
encoding='utf-8'
tojson.loads()
I’ve tested this code:
I’ve also tested this code which only uses values recovered from
f.tell()
as the parameter tof.seek()
as recommended in the documentation.