I have a JSON file (tweetObject.json
) including ~600 lines where each line is a response from the Twitter API, which contains 100 tweets or so along with their metadata.
My Questions:
- How to extract specific tweet attributes, e.g.,
username
from my JSON file? (I was thinking about loading the JSON into pandas dataframe where each column stores only one attribute/field and then select the specific attribute I need. But I’m open to any other solution as well. - How to load the JSON file into a pandas dataframe? I used
json.load
, but I got theJSONDecodeError: Extra data: line 2 column 1 (char 173419)
. After some research, I found the reason for this error is probably becausejson.load
does not decode multiple JSON objects. - I also have a flatten version of my JSON file which keeps 1 tweet per line. I also tried the
json.load
with this file, but still, get the same error.
Here is my code to load the json:
with open('tweetObject_v2.json') as json_file:
data_list = json.load(json_file)
Sorry I didn’t write the sample of the tweet object JSON because even one line of this file was too long. But you can find a sample Twitter API (v2) response here: https://developer.twitter.com/en/docs/twitter-api/data-dictionary/example-payloads
2
Answers
Ok, I finally figured it out. Hope it helps others with similar questions
Answer to question 2: How to load a json file with multiple json objects into pandas dataframe?
Since
json.loads
doesn't decode multiple JSON object, I loop through lines and loaded line by line, and stored the results in an array. Next, I converted the array to a pandas dataframe:Checking columns of df1, I found that some columns contain arrays or objects (i.e., there was no separate column for each attribute). For example, the column
author
keeps the author object which included id (user id), username, etc.Answer to question 1: How to extract/access specific tweet attributes in a tweet JSON file?
In order to be able to access specific attributes (e.g.,
username
), I usedjson_normalize
:NOTES:
If you can read each line of the flattened file and then load into json, you can do this:
Output
If you are able to get a dataframe built, and the dataframe has the user column (which should be a dictionary), you can use this to pull out the screen_name
and this gets the FIRST screen name of a user mention. User mentions is a list so this is taking the first element of the list. It’s a bit more complicated to get all when there’s a list but at least you can a feel of how to navigate the dataframe.