I have a script that collects JSON data from Twitter’s API. This script collects data and parses it with jq
every minute. This data gets collected into a single file that ends up looking like the following:
[
{"text": "Tweet 01",
"id": "001"
},
{"text": "Tweet 02",
"id": "002"
},
{"text": "Tweet 03",
"id": "003"
}
]
[
{"text": "Tweet 04",
"id": "004"
},
{"text": "Tweet 05",
"id": "005"
},
{"text": "Tweet 06",
"id": "006"
},
{"text": "Tweet 07",
"id": "007"
},
{"text": "Tweet 08",
"id": "008"
}
]
[
{"text": "Tweet 09",
"id": "009"
},
{"text": "Tweet 10",
"id": "010"
}
]
I’ve previously had a single list of JSON data per file, and Pandas easily can work with one list in a file. But how can I efficiently iterate over these multiple lists, that are NOT comma-separated and are NOT of necessarily the same length?
My ultimate goal is to aggregate ALL the JSON data from this one file and convert it to a CSV file, where each column is a key in the JSON data. It should end up looking like:
text, id
Tweet 01, 001
Tweet 02, 002
Tweet 03, 003
Tweet 04, 004
Tweet 05, 005
Tweet 06, 006
Tweet 07, 007
Tweet 08, 008
Tweet 09, 009
Tweet 10, 010
If I were to try reading the file anyway, the following occurs:
>>> import pandas as pd
>>> df = pd.read_json("sample.json")
>>> df.head()
Traceback (most recent call last):
File "lists.py", line 3, in <module>
df = pd.read_json("sample.json")
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/util/_decorators.py", line 214, in wrapper
return func(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/json/_json.py", line 608, in read_json
result = json_reader.read()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/json/_json.py", line 731, in read
obj = self._get_object_parser(self.data)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/json/_json.py", line 753, in _get_object_parser
obj = FrameParser(json, **kwargs).parse()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/json/_json.py", line 857, in parse
self._parse_no_numpy()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/json/_json.py", line 1089, in _parse_no_numpy
loads(json, precise_float=self.precise_float), dtype=None
ValueError: Trailing data
3
Answers
.readlines
reads each row of the file in as a list of stringsstr.strip
.str.join
combines items in a list into a single string.str.replace
to replace']['
with','
.ast.literal_eval
to convert the string back to an list.As @Trenton McKinney said, you need to clean the data. So you could use
f.read()
to get the file as a string, thenstr.replace()
to replace the']n['
with','
, because']n['
it’s causing the error, and finallyyou could try with
pd.read_json
:Output:
As long as the original json file fits in memory, you can use
raw_decoder
to pull the json lists out, one at a time.raw_decoder
returns the length of data consumed in the parse and you can trim the string and continue til no data remains.