skip to Main Content

As a fun experiment, I decided to scrape data from Google shopping, it works perfectly on my local but on my server it doesn’t work. Here is the code

#Web driver file

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.headless = False
driver = webdriver.Chrome(options=options, executable_path="/Users/kevin/Documents/projects/deal_hunt/scraper_scripts/chromedriver")
def get_items(url, category):

    driver.get(url)
    results = []
    content = driver.page_source
    soup = BeautifulSoup(content, features="lxml")

    #the first will click on all the images of the products that are on sale, that's the only way to generate the class that is going to allow us to
    #fetch the data we need

    for element in soup.find_all(attrs="i0X6df"):
        #not all items are on sale, those that are, have a label, we will only choose those ones
        sale_label = element.find('span', {'class': 'Ib8pOd'})
        if sale_label is None:
            pass
        else:
            #we want to take the id of the image from the page and dynamically click on it. If we don't do this, selenium will keep clicking on the first picture
            parent_div = element.find('div', {'class': 'ArOc1c'})
            image_tag = parent_div.find('img')

            image_to_click = driver.find_element_by_id(image_tag['id'])
            driver.execute_script("arguments[0].click();", image_to_click)

            time.sleep(5)

    items = driver.find_elements_by_class_name('_-oQ')

    for item in items:

        image_tag = item.find_element_by_class_name('sh-div__current').get_attribute('src')
        description = item.find_element_by_class_name('sh-t__title').get_attribute('text')
        link = item.find_element_by_class_name('sh-t__title').get_attribute('href')

        store =  item.find_element_by_css_selector('._-oA > span').get_attribute('textContent')
        
        price = item.find_elements_by_class_name('_-pX')[0].get_attribute('textContent') 
        old_price = item.find_elements_by_class_name('_-pX')[1].get_attribute('textContent')

         #we only take numbers, because the web page returns a series of weird characters and the price is found at the end of the string

        price_array = price.split(',')
        price = ''.join(re.findall(r'd+', price_array[0])) + '.' + price_array[1]

        old_price_array = old_price.split(',')
        old_price = ''.join(re.findall(r'd+', old_price_array[0])) + '.' + old_price_array[1]

        #remove rand sign
        price = price.replace("R ", "")

        #replace the comma with the dot
        price = price.replace(",", ".")

        #we're trying to get the url of the product inside the google url
        url_to_parse = link
        parsed_url = urlparse(url_to_parse)
        product_url = parse_qs(parsed_url.query)['q'][0]
        results.append({
            'image': image_tag,
            'description': description,
            'store': store,
            'link': product_url,
            'price': float(price),
            'old_price': float(old_price)
        })
     #if we successfully scrape data, we print it, otherwise we skip
    if len(results) > 0:
        print(results)
        print("Command has been perfectly executed")
    else:
        print("There is nothing to add") 

when I run python3 main.py on local, it returns that the command has been perfectly executed, but on my Ubuntu server the same command returns immediately "There is nothing to add"

2

Answers


  1. It would be necessary to verify that you have installed the necessary on your server, including the selenium and python versions, also check the path on the server because that driver may not be running.

    As an additional recommendation, make checkpoints in the code to validate if from the beginning it is not bringing info or if it is somewhere else that is losing it. Superficially in the code I do not see something strange that could generate the error.

    Login or Signup to reply.
  2. As suggested in another response, you should debug your code and ensure that the requests are identical.

    Alternatively, you could try running the spider using containers to avoid any OS particularities. A more escalable option would be to use a cloud-based scraping environment like estela, although it has not been tested, you could try to use Scrapy with Selenium.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search