skip to Main Content

I’m working on a python web scraper to try to grab information for a project I’m doing. I’m using it on twitter atm as I found the twitter api wouldn’t grab information any older than a week. The code I’m using is:

import urllib
import urllib.request
from bs4 import BeautifulSoup as soup

my_url = 'https://twitter.com/search?q=australian%20megafauna&src=typd&lang=en'

page_html = urllib.request.urlopen(my_url)
page_soup = soup(page_html, "html.parser")

print(page_soup.title.text)

for tweet in page_soup.findAll('p', {'class': 'TweetTextSize'}, lang='en'):
    print(tweet.text)

From my understanding, the attribute part of findAll can use a colon to use as a LIKE function and that seems to work okay. the specific part of the HTML I’m looking at using ‘findAll’ is:

<p class="TweetTextSize  js-tweet-text tweet-text" lang="en" data-aria-
label-part="0"></p>

Now I’ve looked through the other tweets and they all seem to use this class however I cannot figure out why it will only return 1 tweet. Strange thing is, it’s not even the first tweet (it’s the second).

If someone could point me in the right direction that’d be great. Thanks.

PS: I’d also like to ask if there was a way to grab ALL the tweets. When browsing through the HTML, I found that there was a class called “stream-container” which had an element ‘data-min-position’ which would change whenever you scrolled down and open up new tweets. I’m thinking even if my code did work it might not be able to see ALL the results of the search and only grab from the initial page. Thanks.

Edit: noticed my code was using a url with lang=’en’ so a little redundant but it doesn’t seem to affect it at all

2

Answers


  1. Chosen as BEST ANSWER

    Thanks for all the help. So I still haven't figured out why my urlrequest was providing me with an incomplete version of the page html. However I've found a work around using selenium as @ksai suggested.

    Here's what it looks like:

    Web Scraper

    import urllib
    import urllib.request
    from bs4 import BeautifulSoup as soup
    from selenium import webdriver
    import time
    
    myurl = 'https://twitter.com/search?q=australian%20megafauna&src=typd&lang=en'
    
    driver = webdriver.Firefox()
    driver.get(myurl)
    #scroll-automation using selenium
    lenOfPage = driver.execute_script("window.scrollTo(0, 
    document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return 
    lenOfPage;")
    match=False
    while(match==False):
        lastCount = lenOfPage
        time.sleep(3)
        lenOfPage = driver.execute_script("window.scrollTo(0, 
    document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return 
    lenOfPage;")
        if lastCount==lenOfPage:
            match=True
    
    page_html = driver.page_source
    page_soup = soup(page_html, "html.parser")
    
    
    print(page_soup.title.text)
    for tweet in page_soup.findAll('p', {'class': 'tweet-text'}, lang='en'):
        print(tweet.text)
    

    So I had absolutely no idea how selenium worked so I just appropriated someone else's solution for scrolling: How to scroll to the end of the page using selenium in python

    @ksai, would there have been an alternate way you would've done it?

    I'm planning to just store the tweets in a csv file as text, would there be a format if you were planning to use it to train a bot?

    Thanks


  2. Try this:

      my_url = 'https://twitter.com/search?q=australian%20megafauna&src=typd&`lang=en'
    
      page_html = urllib.urlopen(myurl).read()
    

    It should work.
    With python3 you can do this:

    import urllib.request
    with urllib.request.urlopen(my_url) as f:
      page_html = f.read()
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search