I am having difficulty in displaying the product list. I want to scrape the data from a webpage. Since I am very very much new to Python and Webscraping. The print(productlist)
is not working.
import requests
from bs4 import BeautifulSoup
import pandas as pd
baseurl = "https://www.thewhiskyexchange.com"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'}
k = requests.get('https://www.thewhiskyexchange.com/c/35/japanese-whisky')
soup=BeautifulSoup(k.text,'html.parser')
productlist = soup.find_all("li",{"class":"product-grid__item"})
print(productlist)
3
Answers
There is nothing wrong with your usage of BeautifulSoup. The problem lies within the site: it is protected by CloudFlare, and attempts to scrape the site will be faced with a JavaScript challenge, a form of CAPTCHA.
In this case, there is not much you can do to bypass CloudFlare.
You can verify this by using
curl
:curl -L https://www.thewhiskyexchange.com
. In the response, you can see this:which is a sign that your scraper is being blocked.
And as @nejdetckenobi said, the website uses JavaScript to load the products, so the components would not load with
requests
. The following is an example usingselenium
instead:Learn more about
selenium
with the documentation here.The site is protected with CloudFlare. But even if you were able to pass the challenge (which is not possible afaik)
that site can not be parsed like that. There are JavaScript parts which runs after the page is loaded. But since the
requests
library doesn’t have the ability of running JavaScript, you won’t get the exact page that you see when you open the link with your browser. This code withrequests
library would only work for static pages that does not contain any JavaScript code.You should be using a "headless web browser" or "web browser driver" with Selenium, to be able to get the exact page you see in your browser window.
You can find the documentation in the link below:
https://selenium-python.readthedocs.io/index.html
The steps should be like:
You are using BeautifulSoup correct 🙂
But You will need to access this webpage in another way then a simple request.get()
Because what you a looking for in
productlist
aka{"class":"product-grid__item"}
is not part of the returned string ink.text
You can check
k.text
contents using another print like suchprint(f"k.text contains: {k.text}")
For me this yeilds the following sting.
Maybe you need to look at another link or using another tool for your product-grid__item, as it is not part of your current k.text.
For your clue of what is wrong, look in the returned k.text:
k.text