I’m trying to scrape new tweets from a Twitter account using Selenium (I’m not sure if selenium is the best way to do this). My script logs into Twitter, navigates to the user’s profile, and captures the latest tweets. While Selenium works to an extent, I’ve encountered issues with bot detection and page blocking after multiple refreshes.
Although I’ve written the code using Selenium, I’m happy to explore other methods (like BeautifulSoup, Scrapy, or any other Python library) if they can achieve my goal more effectively and minimize detection.
My Goal
The script should:
- Log in to Twitter automatically.
- Navigate to a specific user’s profile (in this case, Fabrizio Romano: https://twitter.com/FabrizioRomano).
- Capture and print the latest tweets from the page.
- Avoid printing the same tweet multiple times, even if it reappears after newer tweets are deleted.
- Exclude pinned tweets and reposts (retweets).
Here’s the script I wrote using Selenium:
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import configparser
import time
import random
class TwitterScraper:
def __init__(self, proxy=None, user_agent=None):
# Configure undetected ChromeDriver with optional proxy and user-agent
options = uc.ChromeOptions()
if proxy:
options.add_argument(f"--proxy-server={proxy}")
if user_agent:
options.add_argument(f"user-agent={user_agent}")
options.add_argument("--headless")
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument("--disable-infobars")
options.add_argument("--disable-extensions")
options.add_argument("--disable-gpu")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--lang=en-US")
options.add_argument("--window-size=1920,1080")
self.driver = uc.Chrome(options=options)
self.printed_tweets = set() # Track printed tweets
def login(self):
try:
print("Navigating to Twitter login page...")
# Load credentials from Account.ini
config = configparser.ConfigParser()
config.read(r"C:UsersGamingDocumentsPython TweetsAccount.ini")
email = config.get("x", "email", fallback=None)
username_value = config.get("x", "username", fallback=None)
password_value = config.get("x", "password", fallback=None)
if not email or not password_value or not username_value:
raise ValueError("Email, username, or password missing in Account.ini")
self.driver.get("https://twitter.com/i/flow/login")
time.sleep(3) # Wait for the page to load
# Enter email
email_field = WebDriverWait(self.driver, 15).until(
EC.visibility_of_element_located((By.CSS_SELECTOR, 'input[autocomplete="username"]'))
)
email_field.send_keys(email)
email_field.send_keys("n")
time.sleep(3)
# Enter username if prompted
try:
username_field = WebDriverWait(self.driver, 5).until(
EC.visibility_of_element_located((By.CSS_SELECTOR, 'input[data-testid="ocfEnterTextTextInput"]'))
)
username_field.send_keys(username_value)
username_field.send_keys("n")
time.sleep(3)
except:
print("No additional username prompt detected.")
# Enter password
password_field = WebDriverWait(self.driver, 15).until(
EC.visibility_of_element_located((By.CSS_SELECTOR, 'input[name="password"]'))
)
password_field.send_keys(password_value)
password_field.send_keys("n")
time.sleep(5)
print("Login successful.")
except Exception as e:
print(f"Error during login: {e}")
self.restart()
def navigate_to_page(self, username):
try:
print(f"Navigating to @{username}'s Twitter page...")
user_url = f"https://twitter.com/{username}"
self.driver.get(user_url)
time.sleep(random.uniform(3, 5))
except Exception as e:
print(f"Error navigating to @{username}'s page: {e}")
def get_recent_tweets(self, num_tweets=3):
try:
WebDriverWait(self.driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, 'article[role="article"]'))
)
tweets = self.driver.find_elements(By.CSS_SELECTOR, 'article[role="article"]')
recent_tweets = []
for tweet in tweets:
# Skip pinned tweets and retweets
if tweet.find_elements(By.CSS_SELECTOR, 'svg[aria-label="Pinned Tweet"]') or
tweet.find_elements(By.CSS_SELECTOR, 'svg[aria-label="Retweet"]'):
continue
# Fetch tweet text
try:
tweet_text = tweet.find_element(By.CSS_SELECTOR, 'div[data-testid="tweetText"]').text.strip()
except:
continue
# Fetch timestamp
try:
time_element = tweet.find_element(By.XPATH, './/time')
timestamp = time_element.get_attribute("datetime")
except:
continue
recent_tweets.append((timestamp, tweet_text))
if len(recent_tweets) >= num_tweets:
break
return recent_tweets
except Exception as e:
print(f"Error fetching tweets: {e}")
return []
def start(self, username):
self.login()
self.navigate_to_page(username)
while True:
recent_tweets = self.get_recent_tweets(num_tweets=3)
if recent_tweets:
# Get the most recent tweet (first one in the list)
newest_time, newest_tweet = recent_tweets[0]
if newest_tweet not in self.printed_tweets:
print(f"[{newest_time}] {newest_tweet}")
self.printed_tweets.add(newest_tweet)
time.sleep(5) # Refresh every 5 seconds
self.driver.refresh()
def restart(self):
print("Restarting browser...")
self.driver.quit()
self.__init__() # Reinitialize the driver
def quit(self):
self.driver.quit()
# Usage
if __name__ == "__main__":
scraper = TwitterScraper(proxy=None, user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64)")
try:
scraper.start(username="FabrizioRomano")
except KeyboardInterrupt:
print("Exiting...")
scraper.quit()
Current Issues
My script scans the first three tweets after every refresh but cannot reliably exclude pinned tweets and retweets using CSS selectors alone. This results in previously printed tweets reappearing when newer tweets are deleted. I want the script to ensure that no tweet is printed more than once, even if it reappears later. For example, if the initial tweets are Tweet1, Tweet2, and Tweet3, the script should print Tweet1. After a refresh, if the tweets change to Tweet2, Tweet4, and Tweet5, the script should print Tweet2. If another refresh shows Tweet1, Tweet4, and Tweet5, the script should not reprint Tweet1. Additionally, after refreshing the page several times, Twitter blocks my view of the user’s posts and displays the message: "Something went wrong. Try reloading." (See attached screenshot).
What I’ve Tried
- Login Automation: The login script works fine, and I can navigate to the user’s page.
- Tweet Scanning: My script scans the first three tweets because I couldn’t find a reliable way to skip pinned tweets and retweets programmatically.
- Avoiding the Twitter API: I’ve avoided the API due to its free tier limitations and cost.
- Minimizing Bot Detection: I’ve added delays, randomized pauses, and tried various Selenium options to make the bot less detectable, but the issue persists after multiple page refreshes.
What I Need Help With
-
Better Logic to Skip Pinned and Retweeted Tweets:
Is there a reliable way to identify and exclude these tweets, ideally without relying solely on CSS selectors? -
Preventing Page Blocks by Twitter:
How can I prevent Twitter from blocking my view of the user’s posts after several refreshes? -
Are there adjustments I can make to my Selenium script?
Should I use proxies, rotate user agents, or implement other techniques to minimize detection? -
Exploring Alternatives to Selenium:
Would tools like BeautifulSoup, Scrapy, or other Python libraries be better suited for scraping tweets without detection? I’m open to switching methods if there’s a more robust solution. -
General Advice on Avoiding Detection:
What are the best practices for building a scraper that avoids detection when interacting with sites like Twitter?
Additional Context
-
I’m using Selenium for this project but am open to exploring alternatives. My main objective is to scrape and print tweets reliably, ensuring no duplicates, while avoiding detection or blocking.
-
I understand Twitter’s terms of service regarding scraping and aim to use this responsibly.
Here is the code I made for rotating
Any guidance on improving the current approach or switching to a better method would be greatly appreciated!
2
Answers
CSS selectors are the way to check if an element is pinned or if it is a retweet. Check the site for more reliable selectors. Try inspect cases where your selector doesn’t work and try new selectors.
Generally speaking you can’t. If your target is to catch a new tweet a few seconds after it was posted, it is mission impossible. Just use the API.
Use proxies, rotate user agents (less important), try undetected_chromedriver. Automate retries – catch the moment when twitter shows you ‘Something went wrong’ and restart browser, ideally using new proxy and user agent. This all will make it possible to catch the new tweet in a few minutes. But if you send requests to twitter every few seconds 24/7 they will block you at some point.
BeautifulSoup doesn’t make requests, it only helps you parse the response. Requests themselves are usually made with a default requests library. Scrapy is an alternative, but it is not much better at preventing bot detection.
Rotate proxies, obviously do not scrape from a cloud provider IP. Sites are much more likely to block requests from Linux OS. And do not try to send too many requests from the same IP.
P.S. Are you going to keep you computer all the time online to catch the new tweets? Or you plan to run the code from some server? The second option is problematic.
Have the program pause for a little while (say a random amount between 5s to 30s) so that you look like a human.