I’m trying to scrape html data from reddit when I am logged in, as the information I need is included in the logged-in page, not in the webpage when I am logged out(from find_elements_by_xpath does not work and returns an empty list).
I am using the following code to request login, assuming the login URL is https://www.reddit.com/login/.
import requests
username="myuser"
password="password"
payload = {
'loginUsername': username,
'loginPassword': password
}
# Use 'with' to ensure the session context is closed after use.
s = requests.Session()
headers = {'user-Agent': 'Mozilla/5.0'}
s.headers = headers
#login_url = f"https://www.reddit.com/user/{username}"
#print(login_url)
p = s.post("https://www.reddit.com/login/", data=payload)
# print the html returned or something more intelligent to see if it's a successful login page.
print(p.text)
print(p.status_code)
However, the status code returned is 404 and I get the following for p.text
:
<!DOCTYPE html>
<html lang="en-CA">
<head>
<title>
reddit.com: Not found
</title>
<link rel="shortcut icon" type="image/png" sizes="512x512" href="https://www.redditstatic.com/accountmanager/favicon/favicon-512x512.png">
<link rel="shortcut icon" type="image/png" sizes="192x192" href="https://www.redditstatic.com/accountmanager/favicon/favicon-192x192.png">
<link rel="shortcut icon" type="image/png" sizes="32x32" href="https://www.redditstatic.com/accountmanager/favicon/favicon-32x32.png">
<link rel="shortcut icon" type="image/png" sizes="16x16" href="https://www.redditstatic.com/accountmanager/favicon/favicon-16x16.png">
<link rel="apple-touch-icon" sizes="180x180" href="https://www.redditstatic.com/accountmanager/favicon/apple-touch-icon-180x180.png">
<link rel="mask-icon" href="https://www.redditstatic.com/accountmanager/favicon/safari-pinned-tab.svg" color="#5bbad5">
<meta name="viewport" content="width=device-width, initial-scale=1, viewport-fit=cover">
<meta name="msapplication-TileColor" content="#ffffff"/>
<meta name="msapplication-TileImage" content="https://www.redditstatic.com/accountmanager/favicon/mstile-310x310.png"/>
<meta name="msapplication-TileImage" content="https://www.redditstatic.com/accountmanager/favicon/mstile-310x150.png"/>
<meta name="theme-color" content="#ffffff">
<link rel="stylesheet" href="https://www.redditstatic.com/accountmanager/vendor.4edfac426c2c4357e34e.css">
<link rel="stylesheet" href="https://www.redditstatic.com/accountmanager/theme.02a88d7effc337a0c765.css">
</head>
<body>
<div class="Container m-desktop">
<div class="PageColumns">
<div class="PageColumn PageColumn__left">
<div class="Art"></div>
</div>
<div class="PageColumn PageColumn__right">
<div class="ColumnContainer">
<div class="SnooIcon"></div>
<h1 class="Title">404—Not found</h1>
<p>
The page you are looking for does not exist.
</p>
</div>
</div>
</div>
</div>
<script>
//<![CDATA
window.___r = {"config": {"tracker_endpoint": "https://events.reddit.com/v2", "tracker_key": "AccountManager3", "tracker_secret": "V2FpZ2FlMlZpZTJ3aWVyMWFpc2hhaGhvaHNoZWl3"}};
//]]>
</script>
<script type="text/javascript" src="https://www.redditstatic.com/accountmanager/vendor.33ac2d92b89a211b0483.js"></script>
<script type="text/javascript" src="https://www.redditstatic.com/accountmanager/theme.5333e8893b6d5b30d258.js"></script>
<script type="text/javascript" src="https://www.redditstatic.com/accountmanager/sentry.d25b8843def9b86b36ac.js"></script>
</body>
</html>
I tried using login URL as login_url = f"https://www.reddit.com/user/{username}"
, but it still does not work.
I tried using https://www.reddit.com/login
without the slash at the end, and the status is 400 and there is no output for p.text
.
I believe the username and password I put in is correct. Should the login URL be something different?
I noticed at https://www.reddit.com/login
, the action is as follows:
<form class="AnimatedForm" action="/login" method="post">
2
Answers
The Python requests library is used for making HTTP requests to interact with web services, APIs (Application Programming Interfaces), and websites. It allows you to send and receive data over the internet using the HTTP protocol.
However, it’s not useful for web scraping. I advice using the Selenium webdriver to log in to Reddit
Here is an example code of the Selenium webdriver that I used few seconds ago to log in to Instagram:
Gathering information
If you insepct the Network calls, you’ll see that it request the following data to be passed into the request:
Or
The problem is, that the
csrf_token
is dynamic, and changes for every request. So, what do we do?Finding the
csrf_token
The
csrf_token
is available when sending aGET
request to the page. So, you can use a library such asBeautifulSoup
to extract the token.Notes
I found that you need to set the
content-type
header toapplication/x-www-form-urlencoded
.Code example
See also