This is a follow up to my post Using Python to Scrape Nested Divs and Spans in Twitter?.
I’m not using the Twitter API because it doesn’t look at the tweets by
hashtag this far back. Complete code and output are below after examples.
I want to scrape specific data from each tweet. name
and handle
are retrieving exactly what I’m looking for, but I’m having trouble narrowing down the rest of the elements.
As an example:
link = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
url = link[0]
Retrieves this:
<a class="tweet-timestamp js-permalink js-nav js-tooltip" href="/Mikepeeljourno/status/648787700980408320" title="2:13 AM - 29 Sep 2015">
<span class="_timestamp js-short-timestamp " data-aria-label-part="last" data-long-form="true" data-time="1443518016" data-time-ms="1443518016000">29 Sep 2015</span></a>
For url, I only need the href
value from the first line.
Similarly, the retweets
and favorites
commands return large chunks of html, when all I really need is the numerical value that is displayed for each one.
How can I narrow down the results to the required data for the url, retweetcount and favcount outputs?
I am planning to have this cycle through all the tweets once I get it working, in case that has an influence on your suggestions.
Complete Code:
from bs4 import BeautifulSoup
import requests
import sys
url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
r = requests.get(url, headers=headers)
data = r.text.encode('utf-8')
soup = BeautifulSoup(data, "html.parser")
name = soup('strong', {'class': 'fullname js-action-profile-name show-popup-with-id'})
username = name[0].contents[0]
handle = soup('span', {'class': 'username js-action-profile-name'})
userhandle = handle[0].contents[1].contents[0]
link = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
url = link[0]
messagetext = soup('p', {'class': 'TweetTextSize js-tweet-text tweet-text'})
message = messagetext[0]
retweets = soup('button', {'class': 'ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet'})
retweetcount = retweets[0]
favorites = soup('button', {'class': 'ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite'})
favcount = favorites[0]
print (username, "n", "@", userhandle, "n", "n", url, "n", "n", message, "n", "n", retweetcount, "n", "n", favcount) #extra linebreaks for ease of reading
Complete Output:
Michael Peel
@Mikepeeljourno
<a class="tweet-timestamp js-permalink js-nav js-tooltip" href="/Mikepeeljourno/status/648787700980408320" title="2:13 AM - 29 Sep 2015"><span class="_timestamp js-short-timestamp " data-aria-label-part="last" data-long-form="true" data-time="1443518016" data-time-ms="1443518016000">29 Sep 2015</span></a>
<p class="TweetTextSize js-tweet-text tweet-text" data-aria-label-part="0" lang="en"><a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/FT?src=hash"><s>#</s><b>FT</b></a> Case closed: <a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/Thailand?src=hash"><s>#</s><b>Thailand</b></a> police chief proclaims <a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/Bangkokbombing?src=hash"><s>#</s><b><strong>Bangkokbombing</strong></b></a> solved ahead of his retirement this week -even as questions over case grow</p>
<button class="ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet" data-modal="ProfileTweet-retweet" type="button">
<div class="IconContainer js-tooltip" title="Undo retweet">
<span class="Icon Icon--retweet"></span>
<span class="u-hiddenVisually">Retweeted</span>
</div>
<div class="IconTextContainer">
<span class="ProfileTweet-actionCount">
<span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">4</span>
</span>
</div>
</button>
<button class="ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite" type="button">
<div class="IconContainer js-tooltip" title="Undo like">
<div class="HeartAnimationContainer">
<div class="HeartAnimation"></div>
</div>
<span class="u-hiddenVisually">Liked</span>
</div>
<div class="IconTextContainer">
<span class="ProfileTweet-actionCount">
<span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">2</span>
</span>
</div>
</button>
It was suggested that BeautifulSoup – extracting attribute values may have an answer to this question there. However, I think the question and its answers do not have sufficient context or explanation to be helpful in more complex situations. The link to the relevant part of the Beautiful Soup Documentation is helpful though, http://www.crummy.com/software/BeautifulSoup/documentation.html#The%20attributes%20of%20Tags
2
Answers
Use the dictionary-like access to the
Tag
‘s attributes.For example, to get the
href
attribute value:Or, if you need to get the
href
values for every link found:As a side note, you don’t need to specify the complete
class
value to locate elements.class
is a special multi-valued attribute and you can just use one of the classes (if this is enough to narrow down the search for the desired elements). For example, instead of:You may use:
Or, a CSS selector:
Alecxe already explained to use the ‘href’ key to get the value.
So I’m going to answer the other part of your questions:
.contents returns a list of all the children. Since you’re finding ‘buttons’ which has several children you’re interested in, you can just get them from the following parsed content list:
This will return the value
4
.If you want a rather more readable approach, try this:
This returns
4
and2
respectively.This works because we convert the ResultSet returned by soup/find_all and get the tag element (using [0]) and recursively find across all it’s descendants again using find_all().
Now you can loop across each tweet and extract this information rather easily.