skip to Main Content

I’m trying to scrape html data from reddit when I am logged in, as the information I need is included in the logged-in page, not in the webpage when I am logged out(from find_elements_by_xpath does not work and returns an empty list).

I am using the following code to request login, assuming the login URL is https://www.reddit.com/login/.

import requests


username="myuser"
password="password"
payload = {
            'loginUsername': username,
            'loginPassword': password
        }

# Use 'with' to ensure the session context is closed after use.
s = requests.Session()
headers = {'user-Agent': 'Mozilla/5.0'}

s.headers = headers
#login_url = f"https://www.reddit.com/user/{username}"
#print(login_url)
p = s.post("https://www.reddit.com/login/", data=payload)
# print the html returned or something more intelligent to see if it's a successful login page.
print(p.text)
print(p.status_code)

However, the status code returned is 404 and I get the following for p.text:

<!DOCTYPE html>
<html lang="en-CA">
    <head>
        <title>
            
                reddit.com: Not found
            
        </title>

        <link rel="shortcut icon" type="image/png" sizes="512x512" href="https://www.redditstatic.com/accountmanager/favicon/favicon-512x512.png">
        <link rel="shortcut icon" type="image/png" sizes="192x192" href="https://www.redditstatic.com/accountmanager/favicon/favicon-192x192.png">
        <link rel="shortcut icon" type="image/png" sizes="32x32" href="https://www.redditstatic.com/accountmanager/favicon/favicon-32x32.png">
        <link rel="shortcut icon" type="image/png" sizes="16x16" href="https://www.redditstatic.com/accountmanager/favicon/favicon-16x16.png">
        <link rel="apple-touch-icon" sizes="180x180" href="https://www.redditstatic.com/accountmanager/favicon/apple-touch-icon-180x180.png">
        <link rel="mask-icon" href="https://www.redditstatic.com/accountmanager/favicon/safari-pinned-tab.svg" color="#5bbad5">
        
        <meta name="viewport" content="width=device-width, initial-scale=1, viewport-fit=cover">
        <meta name="msapplication-TileColor" content="#ffffff"/>
        <meta name="msapplication-TileImage" content="https://www.redditstatic.com/accountmanager/favicon/mstile-310x310.png"/>
        <meta name="msapplication-TileImage" content="https://www.redditstatic.com/accountmanager/favicon/mstile-310x150.png"/>
        <meta name="theme-color" content="#ffffff">
        
        

  <link rel="stylesheet" href="https://www.redditstatic.com/accountmanager/vendor.4edfac426c2c4357e34e.css">

  <link rel="stylesheet" href="https://www.redditstatic.com/accountmanager/theme.02a88d7effc337a0c765.css">


    </head>
    <body>
        























  <div class="Container m-desktop">
    <div class="PageColumns">
        
          
          <div class="PageColumn PageColumn__left">
          
            
<div class="Art"></div>

          </div>
        
        <div class="PageColumn PageColumn__right">
          
<div class="ColumnContainer">
  <div class="SnooIcon"></div>
  <h1 class="Title">404&mdash;Not found</h1>
  <p>
    The page you are looking for does not exist.
  </p>
</div>

        </div>
    </div>
</div>


        <script>
            //<![CDATA
                
                window.___r = {"config": {"tracker_endpoint": "https://events.reddit.com/v2", "tracker_key": "AccountManager3", "tracker_secret": "V2FpZ2FlMlZpZTJ3aWVyMWFpc2hhaGhvaHNoZWl3"}};
            //]]>
        </script>
        

  <script type="text/javascript" src="https://www.redditstatic.com/accountmanager/vendor.33ac2d92b89a211b0483.js"></script>

  <script type="text/javascript" src="https://www.redditstatic.com/accountmanager/theme.5333e8893b6d5b30d258.js"></script>

  <script type="text/javascript" src="https://www.redditstatic.com/accountmanager/sentry.d25b8843def9b86b36ac.js"></script>


    </body>
</html>

I tried using login URL as login_url = f"https://www.reddit.com/user/{username}", but it still does not work.
I tried using https://www.reddit.com/login without the slash at the end, and the status is 400 and there is no output for p.text.
I believe the username and password I put in is correct. Should the login URL be something different?

I noticed at https://www.reddit.com/login, the action is as follows:

<form class="AnimatedForm" action="/login" method="post">

2

Answers


  1. The Python requests library is used for making HTTP requests to interact with web services, APIs (Application Programming Interfaces), and websites. It allows you to send and receive data over the internet using the HTTP protocol.

    However, it’s not useful for web scraping. I advice using the Selenium webdriver to log in to Reddit

    Here is an example code of the Selenium webdriver that I used few seconds ago to log in to Instagram:

    from selenium import webdriver
    from selenium.common import NoSuchElementException, TimeoutException
    from selenium.webdriver.common.keys import Keys
    from time import sleep
    
    # ---- Optional - added options to keep the webpage open ----
    options = webdriver.ChromeOptions()
    options.add_experimental_option("detach", True)
    
    def login(self, user_ID, pwd): # userID = yourUserName pwd=Password
        self.driver.get(url="https://www.instagram.com/accounts/login/")
        sleep(2)
        try:
            user_name = self.driver.find_element('xpath', '//*[@id="loginForm"]/div/div[1]/div/label/input')
        except NoSuchElementException:
            print("user_name element not found")
        else:
            user_name.send_keys(user_ID, Keys.TAB)
    
        sleep(2)
        try:
            user_pwd = self.driver.find_element('xpath', '//*[@id="loginForm"]/div/div[2]/div/label/input')
        except NoSuchElementException:
            print("password element not found")
        else:
            user_pwd.send_keys(pwd, Keys.ENTER)
    
        sleep(5)
        try:
            notification_off = self.driver.find_elements('css selector', 'button')
        except NoSuchElementException:
            print("notification element not found")
        else:
            not_off = [item for item in notification_off if item.text == "Not Now"]
            not_off[0].click()
    
    Login or Signup to reply.
  2. Gathering information

    If you insepct the Network calls, you’ll see that it request the following data to be passed into the request:

    enter image description here

    Or

    login_data = {
        "csrf_token" "<RANDOM_VALUE>"
        "otp": "",
        "password": "PASSWORD", password
        "dest": "https://www.reddit.com",
        "username": "USERNAME", username
    }
    

    The problem is, that the csrf_token is dynamic, and changes for every request. So, what do we do?

    Finding the csrf_token

    The csrf_token is available when sending a GET request to the page. So, you can use a library such as BeautifulSoup to extract the token.

    Notes

    I found that you need to set the content-type header to application/x-www-form-urlencoded.

    Code example

    import requests
    from bs4 import BeautifulSoup
    
    LOGIN_URL = "https://www.reddit.com/login"
    
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36",
        "content-type": "application/x-www-form-urlencoded",
    }
    
    login_data = {
        
        "otp": "",
        "password": "PASSWORD",  # Replace with your Reddit password
        "dest": "https://www.reddit.com",
        "username": "USERNAME",  # Replace with your Reddit username
    }
    
    with requests.Session() as session:
        session.headers.update(headers)
    
        # Get the CSRF token
        response = session.get(LOGIN_URL)
        soup = BeautifulSoup(response.content, "html.parser")
        csrf_token = soup.find("input", {"name": "csrf_token"})["value"]
        login_data["csrf_token"] = csrf_token
    
        # Perform login
        with session.post(LOGIN_URL, data=login_data) as response:
            print(response)
    
    
    See also
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search