I’m trying to web scraping many websites, but one of these is "Lime Crime" that is supported by Shopify (AFAIK). I’m using lxml library, but when I tried to use xpath to go to an element I got an empty array, but It actually exists in web page.
import requests
from lxml import html
url = "https://limecrime.com/collections/unicorn-hair-full-coverage"
response = requests.get(url)
byte_data = response.content
source_code = html.fromstring(byte_data)
I’ve tried source_code.cssselect("a.CF-Product__ImageWrapper")
or source_code.cssselect("CF-Product__ImageWrapper")
, but It didn’t work.
Could any one help me to get all the links of products?
3
Answers
This might be simply because the content you are looking for is loaded at a second stage with some Javascript, but is not there in the html page at your specified url.
There is no way to do this with
response
: the data is not there. As an alternative, you can look intoheadless chrome automation
. Libraries that come to mind,puppeteer
and the Python version,pyppeteer
.A headless browser library allows you to essentially run an instance of a full browser, which will parse and download every resource just as a you would see on screen, and provide you a full DOM in the end to parse.
No, it doesn’t. You are probably trying to parse an element that is either being generated in Javascript, or at least having its class assigned in Javascript.
lxml
will not run Javascript code, it will just parse the raw HTML you downloaded from that URL. You can take a look on the HTML via terminal:You can see it returns zero lines.
If you want to take a look on the actual response, you can use just:
curl -s "https://limecrime.com/collections/unicorn-hair-full-coverage"
This will show you exactly what is being parsed by your code.
To get the source code of pages
You can use requests and BeautifulSoup