I have a URL that works on Firefox set to block all cookies and with JavaScript turned off, and yet when I scrape it on Python with urllib
, I get HTTP Error 403: Forbidden
. I use the same user-agent as Firefox, and here is my code:
import urllib
import urllib.request
USER_AGENT_KEY = "User-Agent"
USER_AGENT_VALUE = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/111.0'
def get_page(url)
req = urllib.request.Request(url)
req.add_header(USER_AGENT_KEY, USER_AGENT_VALUE)
# Empty SSL context, only for public websites, don't use this for banks or anything with a sign-in!
response = urllib.request.urlopen(req, context = ssl.SSLContext(), timeout = TIMEOUT)
data = response.read()
html = data.decode('utf-8')
return html # Returns "HTTP Error 403: Forbidden"
I don’t know what mechanisms a site has to detect a user other than JavaScript, cookies, or user-agent. If relevant, one URL is https://www.idealista.pt/comprar-casas/alcobaca/alcobaca-e-vestiaria/com-preco-max_260000,apenas-apartamentos,duplex/
.
How can this site detect the scraper?
2
Answers
The provided URL is a dynamic website that seems to use React or another similar JS framework. This site won’t work wihtout javascript. When you download the page with
curl
, you can see that you will have to enable javascript.This means, you will not get any useful information by just downloading the html page.
The reason why you get 403 is that the page has a script embedded from https://geo.captcha-delivery.com/ which returns 403. I cannot say what this script is about, but it seems like it is some kind of geoblocking api which blocks your request due to some missing information.
Web scraping with urllib / requests is unreliable. Even if you are able to load the page without being detected, some websites use JavaScript to load the data afterwards. A good way to get around this is by using Selenium WebDriver or Playwright. Both of these tools allow you to simulate a web browser and interact with the page as if you were a real user.