Python: Web scraping shopify site

shinjidev
July 28, 2020
248 views
3 votes
3 Answers

I’m trying to web scraping many websites, but one of these is "Lime Crime" that is supported by Shopify (AFAIK). I’m using lxml library, but when I tried to use xpath to go to an element I got an empty array, but It actually exists in web page.

import requests
from lxml import html
url = "https://limecrime.com/collections/unicorn-hair-full-coverage"
response = requests.get(url)
byte_data = response.content
source_code = html.fromstring(byte_data)

I’ve tried source_code.cssselect("a.CF-Product__ImageWrapper") or source_code.cssselect("CF-Product__ImageWrapper"), but It didn’t work.
Could any one help me to get all the links of products?

Answers

- sal
- July 28, 2020 at 2:19 am
- 0 votes
0
This might be simply because the content you are looking for is loaded at a second stage with some Javascript, but is not there in the html page at your specified url.

There is no way to do this with response: the data is not there. As an alternative, you can look into headless chrome automation. Libraries that come to mind, puppeteer and the Python version, pyppeteer.

A headless browser library allows you to essentially run an instance of a full browser, which will parse and download every resource just as a you would see on screen, and provide you a full DOM in the end to parse.

Login or Signup to reply.

- DavidRissatoCruz
- July 28, 2020 at 2:26 am
- 0 votes
0
No, it doesn’t. You are probably trying to parse an element that is either being generated in Javascript, or at least having its class assigned in Javascript.

lxml will not run Javascript code, it will just parse the raw HTML you downloaded from that URL. You can take a look on the HTML via terminal:
```
curl -s "https://limecrime.com/collections/unicorn-hair-full-coverage" | grep "CF-Product__ImageWrapper"
```
You can see it returns zero lines.

If you want to take a look on the actual response, you can use just:

curl -s "https://limecrime.com/collections/unicorn-hair-full-coverage"

This will show you exactly what is being parsed by your code.
Login or Signup to reply.

To get the source code of pages
You can use requests and BeautifulSoup

import requests
from lxml import html
from bs4 import BeautifulSoup
url = "https://limecrime.com/collections/unicorn-hair-full-coverage"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) 
Gecko/20100101 Firefox/78.0'}
s = requests.session()
s.headers.update(headers)
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
print (soup)

Please signup or login to give your own answer.

Click here to cancel reply.