Html - Scraping a webpage with all the variants for a particular product with BeautifulSoup in Python

Archana
September 20, 2023
202 views
3 votes
3 Answers

I am having difficulty in displaying the product list. I want to scrape the data from a webpage. Since I am very very much new to Python and Webscraping. The print(productlist) is not working.

import requests
from bs4 import BeautifulSoup
import pandas as pd

baseurl = "https://www.thewhiskyexchange.com"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,  like Gecko) Chrome/89.0.4389.82 Safari/537.36'}

k = requests.get('https://www.thewhiskyexchange.com/c/35/japanese-whisky')
soup=BeautifulSoup(k.text,'html.parser')
productlist = soup.find_all("li",{"class":"product-grid__item"})
print(productlist)

Answers

- BrandonLi
- September 20, 2023 at 11:09 am
- 0 votes
0
There is nothing wrong with your usage of BeautifulSoup. The problem lies within the site: it is protected by CloudFlare, and attempts to scrape the site will be faced with a JavaScript challenge, a form of CAPTCHA.

In this case, there is not much you can do to bypass CloudFlare.

You can verify this by using curl: curl -L https://www.thewhiskyexchange.com. In the response, you can see this:
```
<span id="challenge-error-text">Enable JavaScript and cookies to continue</span>
```
which is a sign that your scraper is being blocked.

And as @nejdetckenobi said, the website uses JavaScript to load the products, so the components would not load with requests. The following is an example using selenium instead:
```
from selenium import webdriver as wd
from selenium.webdriver.remote.webdriver import By
import time

URL = 'https://www.thewhiskyexchange.com/c/35/japanese-whisky'


def main():
    driver = wd.Chrome(wd.ChromeOptions())

    driver.implicitly_wait(20)
    driver.get(URL)

    time.sleep(5)

    products = driver.find_elements(By.CLASS_NAME, 'product-card__name')

    print([p.text for p in products])


if __name__ == '__main__':
    main()
```
Learn more about selenium with the documentation here.
Login or Signup to reply.

- nejdetckenobi
- September 20, 2023 at 11:09 am
- 0 votes
0
The site is protected with CloudFlare. But even if you were able to pass the challenge (which is not possible afaik)
that site can not be parsed like that. There are JavaScript parts which runs after the page is loaded. But since the requests library doesn’t have the ability of running JavaScript, you won’t get the exact page that you see when you open the link with your browser. This code with requests library would only work for static pages that does not contain any JavaScript code.

You should be using a "headless web browser" or "web browser driver" with Selenium, to be able to get the exact page you see in your browser window.

You can find the documentation in the link below:
https://selenium-python.readthedocs.io/index.html

The steps should be like:
- Open the webpage with Selenium
- Wait until the page is loaded (JS stuff will be processed by the web driver)
- After waiting, get the HTML source from Selenium and pass it to BeautifulSoup
Login or Signup to reply.

- BoKristensen
- September 20, 2023 at 11:14 am
- 0 votes
0
You are using BeautifulSoup correct 🙂
But You will need to access this webpage in another way then a simple request.get()

Because what you a looking for in productlist aka {"class":"product-grid__item"} is not part of the returned string in k.text

You can check k.text contents using another print like such print(f"k.text contains: {k.text}")

For me this yeilds the following sting.
Maybe you need to look at another link or using another tool for your product-grid__item, as it is not part of your current k.text.

For your clue of what is wrong, look in the returned k.text:
```
<span id="challenge-error-text">Enable JavaScript and cookies to continue</span> 
```
k.text

k.Text: Just a
moment…Enable
JavaScript and cookies to
continue(function(){window._cf_chl_opt={cvId:
‘2’,cZone: "www.thewhiskyexchange.com",cType: ‘interactive’,cNounce:
‘45162’,cRay: ‘8098df96ecbfbe4c’,cHash: ‘1cf80c6fbf1a491’,cUPMDTk:
"/c/35/japanese-whisky?__cf_chl_tk=VFxprfxk2K7uMv5y0LpXXlK_dih6FnMe3TwITIiFP6s-1695200377-0-gaNycGzNCfs",cFPWv:
‘b’,cTTimeMs: ‘1000’,cMTimeMs: ‘0’,cTplV: 5,cTplB: ‘cf’,cK:
"visitor-time",fa:
"/c/35/japanese-whisky?_cf_chl_f_tk=VFxprfxk2K7uMv5y0LpXXlK_dih6FnMe3TwITIiFP6s-1695200377-0-gaNycGzNCfs",md:
"5XihUl6T63j9F7hlv4wbmqggz9GgVMAO.1AgdNd9M60-1695200377-0-AT-SWJ4pLjfTHh1KwUSjMvpy6ktRTUDkrD4WDXOafPGNAsu8LEt3hYkTdc_tNxcaTST2AgYa4WU5vGgkAYB5yXRAEonsWe–er-AWe6ffER3PMXl692b2c6KA552e9ahh79FzxPvgDoioIXIK8EYafFjfq80nxBLiy55QfgvCxq425N4NSozOHA3nVkpHrOYpScH8FlkZAE4rEsSu_hMSGIw0Qd6pEV4FMwSlTAKF5AfVL_BgRAjcZE076Lb1Nxdfr89FI77XxvGnMjimmxXXGRSkHDGPfKGgF0RTAB90ETVVHOfl8W-Nh5pAjIZERSDGIu2v3Sf-GRCnfRtDj0IXIAMTvVS_aES8E0no9HYlhIPJ-XdAQDd5Dm6YNUN9SXOjoQYGX4G16wF0ka1HKbfW1GU64Q1F71uM-vnoFkrFOU2fC8hb1Y7T-xrDweMQCZg0b2vAy1Id2U3wZh7MO7bwgsJ35DmcuP60CjunWuXAP9bS1kgOXkhKks9a1RN5d4TJNSYIAK0S8WHPu3Y5rtz4jcoxSTAMPLNgkMA3lBxi2fl2Oa-fjxffzyXpQGBz3rnrH7YxDIx6w2PhMYSaPAMEV1VzmV61rd5uR3Fm_HKHGFV15V5JrEQ8m4_rtkW4IzBEh4wRg1KQZTFy8qHBiqooPxItOPcBaAGokDe7gYi2A5fxao7f8qilJ1MHghhW436rbg872ZL8OZDHzsI3z-y_QIjN5Ncpu2uUWjfHFGe-qStrMlusJp6-JidzsdfG9ERj0AWdfIwp2dLff6LbaoA4PwBKnJsvcSqt7E0KB3W0jyb8ulriSOb1mFzXl9STkn7u9arts_7qVVLJR82bN81dBwkyf0aoqc3nZK98i3dv8zDo_Q4OtWfayiUcj4uwZtMcupxXbxSim7T1llhDCbdsjh0GeiWNOKAL-iClod2ru-Iziq_IID8NzUM8RbGVl98Ooeiqa1AVHBmNBTg3Nno6vuRKJWGp5VlkaHfLX0rTUAr43kPiLxqkzJA88snXuaXc3hmFB_74z-ddRDYt1n6Jzm8i9NUtEk76hERfI_1TtoV9sR8MaNo0A8NXMulu_KdMpdZX2IyNs-lq_DonUgfsaZndO5zgSFChsPpqJOQClCIxnIze0pC5VTjqkeTQXQ__DkK4Mt1A8ubzLAkydDXzWyD_3ooToYaOZNeQuUlcrlwMzDi2fn1CB_tFRrAksVDNAhLP7a1WfDd768jLhCGrfG_SQxTfy7fhf-rGfH8U0flEQHda2SshfusSaurdOVCPI4cIgMznEUhHBdWBst79bIgXbpCa6A_53E8wNteOBYNhBsGGRrdh5Wh2tWe8cvwp7lP3f5tz9PG_555Rs6BZx-YKIh4t-IAtXDTS7E_BvwEayNhL8XWdWH-Bg074utgU_IdA7acHszhAMlXlm8GaWgTKw0WhDw6ipBJrtzcVY-pHALdJlBTXxfsPIsPE6oBNBhmVQTpeHpIFvq6V9ypwTxopgb8ySoYS17ViS0ZSSRbihLtGEoj8S-qBP8-Y_1Yhu_UhGLD9J5LKHza8R4Qar4-KUKaf7yfeKBm6MVTFEi_sqiMYGmjmfbv2QyXHHjAF3MYsRSMWQGQaP3XDJ_sl9j2Wd8_82cfUKwn9ut6EfmzRtiqTMTHA3jNIrt0qNZffIhBI8yjYpud5SGgK0AOrwGY_fCYXGnTek6Ez6Z7QdQf2N2sDJ3CMa8KaQvaG1KZbEwrcW3IBet32RA54pePEQHQOBgb1Fi4rUrp6TzwWhmOzd5TRmU52ep5hMqNZXrqkcAyCyTl0fJQFdDjlrI7zzjDZu_5BLVoQ1Va0PAZN-UxuR7aatPtO2HLAasKvDFkNabVwUpOw1ivOLvsRSUgKQIZYjTLdFyyfkVMut6fEMUH65lb89Y1MvEgn8aBriDLMuJ4zjbgU89khvxYikUGyza6Lj6BP7huIsDpxI3JeZU12-tVzUHkCBZVEjW9Z7kvGFOI20VfGRX4ukzNZ0an_PtF3AP-exf0zZ3zM9s73lpmJv3rQ1c8JP6COSRAxumgntEqUe1NfUs3RNm056WNSvRvdqtUisLaqBMw592bEc5hTQz_sQrmFakRt3r_prmDWd4PVw4dLmxrDkBXp0cLVzFCyDNK2rXneD9eLnaNpdzZaGtZ3fPBf1D1wilDyavdXCnn-ibDA4lGymdeRqwPKLV-bZLv2m1dtnwpHb9K7KjfSUjFyyXfo4DbtYON5OOxSOHwsjMTFeEdhd31JT-_SamVuXkntC2mIutkJc20RvNkJ1Erf9crXYHRWy3muQdQQWZPartYMiLSFNn1jQs-5OA79zqQ0AcgDHE4jKNEEpP5QDqFZniOqrEhSl7pL_890eigayz1N_dmtSe62-2Py4c-J9ZB3zgsFjR7xk2z5B2xxgacVS7JDFss0XxahwdHyiGPhojh1ChlhF3H2qGy0yAMoqnm3eJRhLXRCBIWYKvwpXxG-fD4XZfSO2-OBRuDA",cRq:
{ru:
‘aHR0cHM6Ly93d3cudGhld2hpc2t5ZXhjaGFuZ2UuY29tL2MvMzUvamFwYW5lc2Utd2hpc2t5’,ra:
‘cHl0aG9uLXJlcXVlc3RzLzIuMzEuMA==’,rm: ‘R0VU’,d:
‘JLbxFktqD2tP7FpVa04CFkkVQn7UyCxHXJttuNhO6cWq6Tiq3v7R+45u9vexHyIIgic4OnJUivb+/wMvcvf+1v1dt5hpQYPW8jf5RTRpHptxLJgTwfzezI6h6A6xoFxLm9CamevA9PpsV7F6ZMeZXBfc2TqhOtVRQ6mUFa1XOwrp7I4WLFESubrd383dvoKOTVb3f6x07teL+LQRn15UbRDRShMkA2bYmoeEWVTGK2CEeXaqaV/3NcUcOjyPLgptFRtGsk/Xngmfjx0rAf+Dn2t/FmujRO5zVLGczuJpyYhVeNy9d1AwvAGbcBZoLfstPsiqY+L7pgwqDuP12vOnJTiUcHiBfLPTU1qVdJglwfNxN3CVUmRh2Vt8QZoVgUwvtekgk7vJEfEtM9SQt9Ec06vXh4M+fM0RVlQ9JqyC7YPt+haZL9RsimlRJVLvVDIpSwivepmxSm8nb9PaspXm3WHx6NZMAm38Uvdd0N0GHN4m9oSwixUYs1hiaYE5T9EXYInX8GPhEEWKUr0S+8+qO62fU71G5krgZ359DU4hN8w=’,t:
‘MTY5NTIwMDM3Ny40MzkwMDA=’,cT: Math.floor(Date.now() / 1000),m:
‘a2AsyNIN/NAij7HFCGrX3XvzT5X+w1ymY4tyAvbok+k=’,i1:
‘gUidhQr6vHAcLFLeJUIaGg==’,i2: ’78MQeBOnbKsIzIbpivDnGA==’,zh:
‘Lu11UPKpct80h4UpHjnFr7549vZ5EGi2V7KjmFdfcoc=’,uh:
‘YE9XOpG5TeHmhA1zfs5mxC8CrRZzq2a/+r+OU7dliYQ=’,hh:
‘1iL0YmRIkuteIeg9zu2NzR9zkRexldgCzJYoCkFcSM8=’,}};var cpo =
document.createElement(‘script’);cpo.src =
‘/cdn-cgi/challenge-platform/h/b/orchestrate/chl_page/v1?ray=8098df96ecbfbe4c’;window._cf_chl_opt.cOgUHash
= location.hash === ” && location.href.indexOf(‘#’) !== -1 ? ‘#’ : location.hash;window._cf_chl_opt.cOgUQuery = location.search === ” &&
location.href.slice(0, location.href.length –
window._cf_chl_opt.cOgUHash.length).indexOf(‘?’) !== -1 ? ‘?’ :
location.search;if (window.history && window.history.replaceState)
{var ogU = location.pathname + window._cf_chl_opt.cOgUQuery +
window._cf_chl_opt.cOgUHash;history.replaceState(null, null,
"/c/35/japanese-whisky?__cf_chl_rt_tk=VFxprfxk2K7uMv5y0LpXXlK_dih6FnMe3TwITIiFP6s-1695200377-0-gaNycGzNCfs" + window._cf_chl_opt.cOgUHash);cpo.onload = function() {history.replaceState(null, null,
ogU);}}document.getElementsByTagName(‘head’)[0].appendChild(cpo);}());
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Html – Scraping a webpage with all the variants for a particular product with BeautifulSoup in Python

Answers

k.text