I’m working on a python web scraper to try to grab information for a project I’m doing. I’m using it on twitter atm as I found the twitter api wouldn’t grab information any older than a week. The code I’m using is:
import urllib
import urllib.request
from bs4 import BeautifulSoup as soup
my_url = 'https://twitter.com/search?q=australian%20megafauna&src=typd&lang=en'
page_html = urllib.request.urlopen(my_url)
page_soup = soup(page_html, "html.parser")
print(page_soup.title.text)
for tweet in page_soup.findAll('p', {'class': 'TweetTextSize'}, lang='en'):
print(tweet.text)
From my understanding, the attribute part of findAll can use a colon to use as a LIKE function and that seems to work okay. the specific part of the HTML I’m looking at using ‘findAll’ is:
<p class="TweetTextSize js-tweet-text tweet-text" lang="en" data-aria-
label-part="0"></p>
Now I’ve looked through the other tweets and they all seem to use this class however I cannot figure out why it will only return 1 tweet. Strange thing is, it’s not even the first tweet (it’s the second).
If someone could point me in the right direction that’d be great. Thanks.
PS: I’d also like to ask if there was a way to grab ALL the tweets. When browsing through the HTML, I found that there was a class called “stream-container” which had an element ‘data-min-position’ which would change whenever you scrolled down and open up new tweets. I’m thinking even if my code did work it might not be able to see ALL the results of the search and only grab from the initial page. Thanks.
Edit: noticed my code was using a url with lang=’en’ so a little redundant but it doesn’t seem to affect it at all
2
Answers
Thanks for all the help. So I still haven't figured out why my urlrequest was providing me with an incomplete version of the page html. However I've found a work around using selenium as @ksai suggested.
Here's what it looks like:
Web Scraper
So I had absolutely no idea how selenium worked so I just appropriated someone else's solution for scrolling: How to scroll to the end of the page using selenium in python
@ksai, would there have been an alternate way you would've done it?
I'm planning to just store the tweets in a csv file as text, would there be a format if you were planning to use it to train a bot?
Thanks
Try this:
It should work.
With
python3
you can do this: