skip to Main Content

I have a JSON file (tweetObject.json) including ~600 lines where each line is a response from the Twitter API, which contains 100 tweets or so along with their metadata.

My Questions:

  1. How to extract specific tweet attributes, e.g., username from my JSON file? (I was thinking about loading the JSON into pandas dataframe where each column stores only one attribute/field and then select the specific attribute I need. But I’m open to any other solution as well.
  2. How to load the JSON file into a pandas dataframe? I used json.load, but I got the JSONDecodeError: Extra data: line 2 column 1 (char 173419). After some research, I found the reason for this error is probably because json.load does not decode multiple JSON objects.
  3. I also have a flatten version of my JSON file which keeps 1 tweet per line. I also tried the json.load with this file, but still, get the same error.

Here is my code to load the json:

with open('tweetObject_v2.json') as json_file:
    data_list = json.load(json_file)   

Sorry I didn’t write the sample of the tweet object JSON because even one line of this file was too long. But you can find a sample Twitter API (v2) response here: https://developer.twitter.com/en/docs/twitter-api/data-dictionary/example-payloads

2

Answers


  1. Chosen as BEST ANSWER

    Ok, I finally figured it out. Hope it helps others with similar questions

    Answer to question 2: How to load a json file with multiple json objects into pandas dataframe?

    Since json.loads doesn't decode multiple JSON object, I loop through lines and loaded line by line, and stored the results in an array. Next, I converted the array to a pandas dataframe:

    tweets = []
    for line in open('tweetsFlatten.json', 'r'):
         tweets.append(json.loads(line))
    
    df1 = pd.DataFrame(tweets)
    

    Checking columns of df1, I found that some columns contain arrays or objects (i.e., there was no separate column for each attribute). For example, the column author keeps the author object which included id (user id), username, etc.

    Answer to question 1: How to extract/access specific tweet attributes in a tweet JSON file?

    In order to be able to access specific attributes (e.g., username), I used json_normalize:

    df_new = json_normalize(tweets)  
    df_new.columns  
    #df_new is a new df where each arrtibute has a separate column. 
    #For example, instead of an `author` column which kept various attributes (id, username, etc.), the new dataframe has separate columns for each of them (e.g., `author.id`, `author.username`, etc.
    
    df_new['author.username'].head() 
    

    NOTES:

    • I used the Twitter API V2, so the response JSON file was in version 2 format.
    • I used the flatten version of the json file, because I found it easier to work with (e.g., to access specific attributes)

  2. If you can read each line of the flattened file and then load into json, you can do this:

    data = example line above in your questions
    dataj = json.loads(data)
    dataj['author']['username']
    

    Output

    'Megresistor'
    

    If you are able to get a dataframe built, and the dataframe has the user column (which should be a dictionary), you can use this to pull out the screen_name

    df.user.str.get('screen_name')
    

    and this gets the FIRST screen name of a user mention. User mentions is a list so this is taking the first element of the list. It’s a bit more complicated to get all when there’s a list but at least you can a feel of how to navigate the dataframe.

    df.entities.str.get('user_mentions').str[0].str.get('screen_name')
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search