Python Web Scraper not collecting all the text I have indicated - Twitter API

AmazedBystander
August 4, 2017
101 views
0 votes
2 Answers

I’m working on a python web scraper to try to grab information for a project I’m doing. I’m using it on twitter atm as I found the twitter api wouldn’t grab information any older than a week. The code I’m using is:

import urllib
import urllib.request
from bs4 import BeautifulSoup as soup

my_url = 'https://twitter.com/search?q=australian%20megafauna&src=typd&lang=en'

page_html = urllib.request.urlopen(my_url)
page_soup = soup(page_html, "html.parser")

print(page_soup.title.text)

for tweet in page_soup.findAll('p', {'class': 'TweetTextSize'}, lang='en'):
    print(tweet.text)

From my understanding, the attribute part of findAll can use a colon to use as a LIKE function and that seems to work okay. the specific part of the HTML I’m looking at using ‘findAll’ is:

<p class="TweetTextSize  js-tweet-text tweet-text" lang="en" data-aria-
label-part="0"></p>

Now I’ve looked through the other tweets and they all seem to use this class however I cannot figure out why it will only return 1 tweet. Strange thing is, it’s not even the first tweet (it’s the second).

If someone could point me in the right direction that’d be great. Thanks.

PS: I’d also like to ask if there was a way to grab ALL the tweets. When browsing through the HTML, I found that there was a class called “stream-container” which had an element ‘data-min-position’ which would change whenever you scrolled down and open up new tweets. I’m thinking even if my code did work it might not be able to see ALL the results of the search and only grab from the initial page. Thanks.

Edit: noticed my code was using a url with lang=’en’ so a little redundant but it doesn’t seem to affect it at all

Answers

Chosen as BEST ANSWER
- AmazedBystander
- August 4, 2017 at 12:49 pm
- 0 votes
0
Thanks for all the help. So I still haven't figured out why my urlrequest was providing me with an incomplete version of the page html. However I've found a work around using selenium as @ksai suggested.

Here's what it looks like:

Web Scraper
```
import urllib
import urllib.request
from bs4 import BeautifulSoup as soup
from selenium import webdriver
import time

myurl = 'https://twitter.com/search?q=australian%20megafauna&src=typd&lang=en'

driver = webdriver.Firefox()
driver.get(myurl)
#scroll-automation using selenium
lenOfPage = driver.execute_script("window.scrollTo(0, 
document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return 
lenOfPage;")
match=False
while(match==False):
    lastCount = lenOfPage
    time.sleep(3)
    lenOfPage = driver.execute_script("window.scrollTo(0, 
document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return 
lenOfPage;")
    if lastCount==lenOfPage:
        match=True

page_html = driver.page_source
page_soup = soup(page_html, "html.parser")


print(page_soup.title.text)
for tweet in page_soup.findAll('p', {'class': 'tweet-text'}, lang='en'):
    print(tweet.text)
```
So I had absolutely no idea how selenium worked so I just appropriated someone else's solution for scrolling: How to scroll to the end of the page using selenium in python

@ksai, would there have been an alternate way you would've done it?

I'm planning to just store the tweets in a csv file as text, would there be a format if you were planning to use it to train a bot?

Thanks

(Edit)

- Mekicha
- August 4, 2017 at 12:10 pm
- 0 votes
0
Try this:
```
  my_url = 'https://twitter.com/search?q=australian%20megafauna&src=typd&`lang=en'

  page_html = urllib.urlopen(myurl).read()
```
It should work.
With python3 you can do this:
```
import urllib.request
with urllib.request.urlopen(my_url) as f:
  page_html = f.read()
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Python Web Scraper not collecting all the text I have indicated – Twitter API

Answers

Web Scraper