skip to Main Content

This is a follow up to my post Using Python to Scrape Nested Divs and Spans in Twitter?.

I’m not using the Twitter API because it doesn’t look at the tweets by
hashtag this far back. Complete code and output are below after examples.

I want to scrape specific data from each tweet. name and handle are retrieving exactly what I’m looking for, but I’m having trouble narrowing down the rest of the elements.

As an example:

 link = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
 url = link[0]

Retrieves this:

 <a class="tweet-timestamp js-permalink js-nav js-tooltip" href="/Mikepeeljourno/status/648787700980408320" title="2:13 AM - 29 Sep 2015">
 <span class="_timestamp js-short-timestamp " data-aria-label-part="last" data-long-form="true" data-time="1443518016" data-time-ms="1443518016000">29 Sep 2015</span></a>

For url, I only need the href value from the first line.

Similarly, the retweets and favorites commands return large chunks of html, when all I really need is the numerical value that is displayed for each one.

How can I narrow down the results to the required data for the url, retweetcount and favcount outputs?

I am planning to have this cycle through all the tweets once I get it working, in case that has an influence on your suggestions.

Complete Code:

 from bs4 import BeautifulSoup
 import requests
 import sys

 url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'
 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
 r = requests.get(url, headers=headers)
 data = r.text.encode('utf-8')
 soup = BeautifulSoup(data, "html.parser")

 name = soup('strong', {'class': 'fullname js-action-profile-name show-popup-with-id'})
 username = name[0].contents[0]

 handle = soup('span', {'class': 'username js-action-profile-name'})
 userhandle = handle[0].contents[1].contents[0]

 link = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
 url = link[0]

 messagetext = soup('p', {'class': 'TweetTextSize  js-tweet-text tweet-text'})
 message = messagetext[0]

 retweets = soup('button', {'class': 'ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet'})
 retweetcount = retweets[0]

 favorites = soup('button', {'class': 'ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite'})
 favcount = favorites[0]

 print (username, "n", "@", userhandle, "n", "n", url, "n", "n", message, "n", "n", retweetcount, "n", "n", favcount) #extra linebreaks for ease of reading

Complete Output:

Michael Peel

@Mikepeeljourno

<a class="tweet-timestamp js-permalink js-nav js-tooltip" href="/Mikepeeljourno/status/648787700980408320" title="2:13 AM - 29 Sep 2015"><span class="_timestamp js-short-timestamp " data-aria-label-part="last" data-long-form="true" data-time="1443518016" data-time-ms="1443518016000">29 Sep 2015</span></a>

<p class="TweetTextSize js-tweet-text tweet-text" data-aria-label-part="0" lang="en"><a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/FT?src=hash"><s>#</s><b>FT</b></a> Case closed: <a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/Thailand?src=hash"><s>#</s><b>Thailand</b></a> police chief proclaims <a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/Bangkokbombing?src=hash"><s>#</s><b><strong>Bangkokbombing</strong></b></a> solved ahead of his retirement this week -even as questions over case grow</p>

<button class="ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet" data-modal="ProfileTweet-retweet" type="button">
<div class="IconContainer js-tooltip" title="Undo retweet">
<span class="Icon Icon--retweet"></span>
<span class="u-hiddenVisually">Retweeted</span>
</div>
<div class="IconTextContainer">
<span class="ProfileTweet-actionCount">
<span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">4</span>
</span>
</div>
</button>

<button class="ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite" type="button">
<div class="IconContainer js-tooltip" title="Undo like">
<div class="HeartAnimationContainer">
<div class="HeartAnimation"></div>
</div>
<span class="u-hiddenVisually">Liked</span>
</div>
<div class="IconTextContainer">
<span class="ProfileTweet-actionCount">
<span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">2</span>
</span>
</div>
</button>

It was suggested that BeautifulSoup – extracting attribute values may have an answer to this question there. However, I think the question and its answers do not have sufficient context or explanation to be helpful in more complex situations. The link to the relevant part of the Beautiful Soup Documentation is helpful though, http://www.crummy.com/software/BeautifulSoup/documentation.html#The%20attributes%20of%20Tags

2

Answers


  1. Use the dictionary-like access to the Tag‘s attributes.

    For example, to get the href attribute value:

    links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
    url = link[0]["href"]
    

    Or, if you need to get the href values for every link found:

    links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
    urls = [link["href"] for link in links]
    

    As a side note, you don’t need to specify the complete class value to locate elements. class is a special multi-valued attribute and you can just use one of the classes (if this is enough to narrow down the search for the desired elements). For example, instead of:

    soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
    

    You may use:

    soup('a', {'class': 'tweet-timestamp'})
    

    Or, a CSS selector:

    soup.select("a.tweet-timestamp")
    
    Login or Signup to reply.
  2. Alecxe already explained to use the ‘href’ key to get the value.

    So I’m going to answer the other part of your questions:

    Similarly, the retweets and favorites commands return large chunks of
    html, when all I really need is the numerical value that is displayed
    for each one.

    .contents returns a list of all the children. Since you’re finding ‘buttons’ which has several children you’re interested in, you can just get them from the following parsed content list:

    retweetcount = retweets[0].contents[3].contents[1].contents[1].string
    

    This will return the value 4.

    If you want a rather more readable approach, try this:

    retweetcount = retweets[0].find_all('span', class_='ProfileTweet-actionCountForPresentation')[0].string
    
    favcount = favorites[0].find_all('span', { 'class' : 'ProfileTweet-actionCountForPresentation')[0].string
    

    This returns 4 and 2 respectively.
    This works because we convert the ResultSet returned by soup/find_all and get the tag element (using [0]) and recursively find across all it’s descendants again using find_all().

    Now you can loop across each tweet and extract this information rather easily.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search