skip to Main Content

I’m trying to web scraping many websites, but one of these is "Lime Crime" that is supported by Shopify (AFAIK). I’m using lxml library, but when I tried to use xpath to go to an element I got an empty array, but It actually exists in web page.

import requests
from lxml import html
url = "https://limecrime.com/collections/unicorn-hair-full-coverage"
response = requests.get(url)
byte_data = response.content
source_code = html.fromstring(byte_data)

I’ve tried source_code.cssselect("a.CF-Product__ImageWrapper") or source_code.cssselect("CF-Product__ImageWrapper"), but It didn’t work.
Could any one help me to get all the links of products?

3

Answers


  1. This might be simply because the content you are looking for is loaded at a second stage with some Javascript, but is not there in the html page at your specified url.

    There is no way to do this with response: the data is not there. As an alternative, you can look into headless chrome automation. Libraries that come to mind, puppeteer and the Python version, pyppeteer.

    A headless browser library allows you to essentially run an instance of a full browser, which will parse and download every resource just as a you would see on screen, and provide you a full DOM in the end to parse.

    Login or Signup to reply.
  2. No, it doesn’t. You are probably trying to parse an element that is either being generated in Javascript, or at least having its class assigned in Javascript.

    lxml will not run Javascript code, it will just parse the raw HTML you downloaded from that URL. You can take a look on the HTML via terminal:

    curl -s "https://limecrime.com/collections/unicorn-hair-full-coverage" | grep "CF-Product__ImageWrapper"
    

    You can see it returns zero lines.

    If you want to take a look on the actual response, you can use just:

    curl -s "https://limecrime.com/collections/unicorn-hair-full-coverage"

    This will show you exactly what is being parsed by your code.

    Login or Signup to reply.
  3. To get the source code of pages
    You can use requests and BeautifulSoup

    import requests
    from lxml import html
    from bs4 import BeautifulSoup
    url = "https://limecrime.com/collections/unicorn-hair-full-coverage"
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) 
    Gecko/20100101 Firefox/78.0'}
    s = requests.session()
    s.headers.update(headers)
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'lxml')
    print (soup)
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search