As a fun experiment, I decided to scrape data from Google shopping, it works perfectly on my local but on my server it doesn’t work. Here is the code
#Web driver file
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.headless = False
driver = webdriver.Chrome(options=options, executable_path="/Users/kevin/Documents/projects/deal_hunt/scraper_scripts/chromedriver")
def get_items(url, category):
driver.get(url)
results = []
content = driver.page_source
soup = BeautifulSoup(content, features="lxml")
#the first will click on all the images of the products that are on sale, that's the only way to generate the class that is going to allow us to
#fetch the data we need
for element in soup.find_all(attrs="i0X6df"):
#not all items are on sale, those that are, have a label, we will only choose those ones
sale_label = element.find('span', {'class': 'Ib8pOd'})
if sale_label is None:
pass
else:
#we want to take the id of the image from the page and dynamically click on it. If we don't do this, selenium will keep clicking on the first picture
parent_div = element.find('div', {'class': 'ArOc1c'})
image_tag = parent_div.find('img')
image_to_click = driver.find_element_by_id(image_tag['id'])
driver.execute_script("arguments[0].click();", image_to_click)
time.sleep(5)
items = driver.find_elements_by_class_name('_-oQ')
for item in items:
image_tag = item.find_element_by_class_name('sh-div__current').get_attribute('src')
description = item.find_element_by_class_name('sh-t__title').get_attribute('text')
link = item.find_element_by_class_name('sh-t__title').get_attribute('href')
store = item.find_element_by_css_selector('._-oA > span').get_attribute('textContent')
price = item.find_elements_by_class_name('_-pX')[0].get_attribute('textContent')
old_price = item.find_elements_by_class_name('_-pX')[1].get_attribute('textContent')
#we only take numbers, because the web page returns a series of weird characters and the price is found at the end of the string
price_array = price.split(',')
price = ''.join(re.findall(r'd+', price_array[0])) + '.' + price_array[1]
old_price_array = old_price.split(',')
old_price = ''.join(re.findall(r'd+', old_price_array[0])) + '.' + old_price_array[1]
#remove rand sign
price = price.replace("R ", "")
#replace the comma with the dot
price = price.replace(",", ".")
#we're trying to get the url of the product inside the google url
url_to_parse = link
parsed_url = urlparse(url_to_parse)
product_url = parse_qs(parsed_url.query)['q'][0]
results.append({
'image': image_tag,
'description': description,
'store': store,
'link': product_url,
'price': float(price),
'old_price': float(old_price)
})
#if we successfully scrape data, we print it, otherwise we skip
if len(results) > 0:
print(results)
print("Command has been perfectly executed")
else:
print("There is nothing to add")
when I run python3 main.py on local, it returns that the command has been perfectly executed, but on my Ubuntu server the same command returns immediately "There is nothing to add"
2
Answers
It would be necessary to verify that you have installed the necessary on your server, including the selenium and python versions, also check the path on the server because that driver may not be running.
As an additional recommendation, make checkpoints in the code to validate if from the beginning it is not bringing info or if it is somewhere else that is losing it. Superficially in the code I do not see something strange that could generate the error.
As suggested in another response, you should debug your code and ensure that the requests are identical.
Alternatively, you could try running the spider using containers to avoid any OS particularities. A more escalable option would be to use a cloud-based scraping environment like estela, although it has not been tested, you could try to use Scrapy with Selenium.