Html - using Python requests library to log into reddit

wkde
August 8, 2023
236 views
0 votes
2 Answers

I’m trying to scrape html data from reddit when I am logged in, as the information I need is included in the logged-in page, not in the webpage when I am logged out(from find_elements_by_xpath does not work and returns an empty list).

I am using the following code to request login, assuming the login URL is https://www.reddit.com/login/.

import requests


username="myuser"
password="password"
payload = {
            'loginUsername': username,
            'loginPassword': password
        }

# Use 'with' to ensure the session context is closed after use.
s = requests.Session()
headers = {'user-Agent': 'Mozilla/5.0'}

s.headers = headers
#login_url = f"https://www.reddit.com/user/{username}"
#print(login_url)
p = s.post("https://www.reddit.com/login/", data=payload)
# print the html returned or something more intelligent to see if it's a successful login page.
print(p.text)
print(p.status_code)

However, the status code returned is 404 and I get the following for p.text:

<!DOCTYPE html>
<html lang="en-CA">
    <head>
        <title>
            
                reddit.com: Not found
            
        </title>

        <link rel="shortcut icon" type="image/png" sizes="512x512" href="https://www.redditstatic.com/accountmanager/favicon/favicon-512x512.png">
        <link rel="shortcut icon" type="image/png" sizes="192x192" href="https://www.redditstatic.com/accountmanager/favicon/favicon-192x192.png">
        <link rel="shortcut icon" type="image/png" sizes="32x32" href="https://www.redditstatic.com/accountmanager/favicon/favicon-32x32.png">
        <link rel="shortcut icon" type="image/png" sizes="16x16" href="https://www.redditstatic.com/accountmanager/favicon/favicon-16x16.png">
        <link rel="apple-touch-icon" sizes="180x180" href="https://www.redditstatic.com/accountmanager/favicon/apple-touch-icon-180x180.png">
        <link rel="mask-icon" href="https://www.redditstatic.com/accountmanager/favicon/safari-pinned-tab.svg" color="#5bbad5">
        
        <meta name="viewport" content="width=device-width, initial-scale=1, viewport-fit=cover">
        <meta name="msapplication-TileColor" content="#ffffff"/>
        <meta name="msapplication-TileImage" content="https://www.redditstatic.com/accountmanager/favicon/mstile-310x310.png"/>
        <meta name="msapplication-TileImage" content="https://www.redditstatic.com/accountmanager/favicon/mstile-310x150.png"/>
        <meta name="theme-color" content="#ffffff">
        
        

  <link rel="stylesheet" href="https://www.redditstatic.com/accountmanager/vendor.4edfac426c2c4357e34e.css">

  <link rel="stylesheet" href="https://www.redditstatic.com/accountmanager/theme.02a88d7effc337a0c765.css">


    </head>
    <body>
        























  <div class="Container m-desktop">
    <div class="PageColumns">
        
          
          <div class="PageColumn PageColumn__left">
          
            
<div class="Art"></div>

          </div>
        
        <div class="PageColumn PageColumn__right">
          
<div class="ColumnContainer">
  <div class="SnooIcon"></div>
  <h1 class="Title">404&mdash;Not found</h1>
  <p>
    The page you are looking for does not exist.
  </p>
</div>

        </div>
    </div>
</div>


        <script>
            //<![CDATA
                
                window.___r = {"config": {"tracker_endpoint": "https://events.reddit.com/v2", "tracker_key": "AccountManager3", "tracker_secret": "V2FpZ2FlMlZpZTJ3aWVyMWFpc2hhaGhvaHNoZWl3"}};
            //]]>
        </script>
        

  <script type="text/javascript" src="https://www.redditstatic.com/accountmanager/vendor.33ac2d92b89a211b0483.js"></script>

  <script type="text/javascript" src="https://www.redditstatic.com/accountmanager/theme.5333e8893b6d5b30d258.js"></script>

  <script type="text/javascript" src="https://www.redditstatic.com/accountmanager/sentry.d25b8843def9b86b36ac.js"></script>


    </body>
</html>

I tried using login URL as login_url = f"https://www.reddit.com/user/{username}", but it still does not work.
I tried using https://www.reddit.com/login without the slash at the end, and the status is 400 and there is no output for p.text.
I believe the username and password I put in is correct. Should the login URL be something different?

I noticed at https://www.reddit.com/login, the action is as follows:

<form class="AnimatedForm" action="/login" method="post">

Answers

The Python requests library is used for making HTTP requests to interact with web services, APIs (Application Programming Interfaces), and websites. It allows you to send and receive data over the internet using the HTTP protocol.

However, it’s not useful for web scraping. I advice using the Selenium webdriver to log in to Reddit

Here is an example code of the Selenium webdriver that I used few seconds ago to log in to Instagram:

from selenium import webdriver
from selenium.common import NoSuchElementException, TimeoutException
from selenium.webdriver.common.keys import Keys
from time import sleep

# ---- Optional - added options to keep the webpage open ----
options = webdriver.ChromeOptions()
options.add_experimental_option("detach", True)

def login(self, user_ID, pwd): # userID = yourUserName pwd=Password
    self.driver.get(url="https://www.instagram.com/accounts/login/")
    sleep(2)
    try:
        user_name = self.driver.find_element('xpath', '//*[@id="loginForm"]/div/div[1]/div/label/input')
    except NoSuchElementException:
        print("user_name element not found")
    else:
        user_name.send_keys(user_ID, Keys.TAB)

    sleep(2)
    try:
        user_pwd = self.driver.find_element('xpath', '//*[@id="loginForm"]/div/div[2]/div/label/input')
    except NoSuchElementException:
        print("password element not found")
    else:
        user_pwd.send_keys(pwd, Keys.ENTER)

    sleep(5)
    try:
        notification_off = self.driver.find_elements('css selector', 'button')
    except NoSuchElementException:
        print("notification element not found")
    else:
        not_off = [item for item in notification_off if item.text == "Not Now"]
        not_off[0].click()

Gathering information

If you insepct the Network calls, you’ll see that it request the following data to be passed into the request:

login_data = {
    "csrf_token" "<RANDOM_VALUE>"
    "otp": "",
    "password": "PASSWORD", password
    "dest": "https://www.reddit.com",
    "username": "USERNAME", username
}

The problem is, that the csrf_token is dynamic, and changes for every request. So, what do we do?

Finding the `csrf_token`

The csrf_token is available when sending a GET request to the page. So, you can use a library such as BeautifulSoup to extract the token.

Notes

I found that you need to set the content-type header to application/x-www-form-urlencoded.

Code example

import requests
from bs4 import BeautifulSoup

LOGIN_URL = "https://www.reddit.com/login"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36",
    "content-type": "application/x-www-form-urlencoded",
}

login_data = {
    
    "otp": "",
    "password": "PASSWORD",  # Replace with your Reddit password
    "dest": "https://www.reddit.com",
    "username": "USERNAME",  # Replace with your Reddit username
}

with requests.Session() as session:
    session.headers.update(headers)

    # Get the CSRF token
    response = session.get(LOGIN_URL)
    soup = BeautifulSoup(response.content, "html.parser")
    csrf_token = soup.find("input", {"name": "csrf_token"})["value"]
    login_data["csrf_token"] = csrf_token

    # Perform login
    with session.post(LOGIN_URL, data=login_data) as response:
        print(response)

Html – using Python requests library to log into reddit

Answers

Gathering information

Finding the `csrf_token`

Notes

Code example

See also

Html – using Python requests library to log into reddit

Answers

Gathering information

Finding the csrf_token

Notes

Code example

See also

Finding the `csrf_token`