403 error when scraping a URL that works on Firefox without cookies nor javascript

miguelmorin
April 15, 2023
218 views
1 vote
2 Answers

I have a URL that works on Firefox set to block all cookies and with JavaScript turned off, and yet when I scrape it on Python with urllib, I get HTTP Error 403: Forbidden. I use the same user-agent as Firefox, and here is my code:

import urllib
import urllib.request

USER_AGENT_KEY = "User-Agent"
USER_AGENT_VALUE = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/111.0'

def get_page(url)
    req = urllib.request.Request(url)
    req.add_header(USER_AGENT_KEY, USER_AGENT_VALUE)

    # Empty SSL context, only for public websites, don't use this for banks or anything with a sign-in!
    response = urllib.request.urlopen(req, context = ssl.SSLContext(), timeout = TIMEOUT)

    data = response.read()
    html = data.decode('utf-8') 

    return html  # Returns "HTTP Error 403: Forbidden"

I don’t know what mechanisms a site has to detect a user other than JavaScript, cookies, or user-agent. If relevant, one URL is https://www.idealista.pt/comprar-casas/alcobaca/alcobaca-e-vestiaria/com-preco-max_260000,apenas-apartamentos,duplex/.

How can this site detect the scraper?

Answers

- slarag
- April 15, 2023 at 2:24 pm
- 0 votes
0
The provided URL is a dynamic website that seems to use React or another similar JS framework. This site won’t work wihtout javascript. When you download the page with curl, you can see that you will have to enable javascript.
This means, you will not get any useful information by just downloading the html page.

The reason why you get 403 is that the page has a script embedded from https://geo.captcha-delivery.com/ which returns 403. I cannot say what this script is about, but it seems like it is some kind of geoblocking api which blocks your request due to some missing information.

Login or Signup to reply.

- linuxsushi
- April 15, 2023 at 2:28 pm
- 0 votes
0
Web scraping with urllib / requests is unreliable. Even if you are able to load the page without being detected, some websites use JavaScript to load the data afterwards. A good way to get around this is by using Selenium WebDriver or Playwright. Both of these tools allow you to simulate a web browser and interact with the page as if you were a real user.

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.