skip to Main Content

I am trying to crawl Product Hunt using Selenium

More specifically I am trying to get the source link for all the icons of the products.

HTML:

here is the html code

My code for crawling is the following:

driver = webdriver.Chrome("<Your driver's path>")
driver.get("https://www.producthunt.com/topics/seo-tools?order=most-upvoted")
time.sleep(4)
icons = driver.find_elements_by_css_selector("div.styles_thumbnail__d2DAK.styles_thumbnail__XBHZ_ img")
print(len(icons))
print(icons)
driver.close()

The problem is that selenium only gets the 3 first pictures and not all the products available.

I have tried increasing the sleep time as well as implemented the driver.wait way along with EC.presence_of_all_elements_located to be sure that all icons are loaded properly.

2

Answers


  1. Since the other icons show when you scroll at the bottom of the page, you can do like this

    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium import webdriver
    
    driver = webdriver.Chrome()
    driver.get("https://www.producthunt.com/topics/seo-tools?order=most-upvoted")
    
    expected_number_of_icons = 20
    
    icons = []
    while True:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
        icons = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//div[contains(@data-test, 'post-item')]//div[@class='styles_thumbnail__d2DAK styles_thumbnail__XBHZ_']//img | //div[contains(@class, 'styles_link')]//span[@class='lazyload-wrapper']/img")))
        icons = list(set(icons))
        if len(icons) > expected_number_of_icons:
            break
    
    icons = icons[:expected_number_of_icons]
    driver.close()
    

    where you choose to stop when you reach the number of icons that you want. Obviously, for example if you reach 210 icons and you want only 200 icons you can discard the last 10 elements of the list

    Login or Signup to reply.
  2. To print the value of the src attribute you can use either of the following Locator Strategies:

    • Using css_selector:

      print([my_elem.get_attribute("src") for my_elem in driver.find_elements_by_css_selector("span.lazyload-wrapper > img")])
      
    • Using xpath:

      print([my_elem.get_attribute("src") for my_elem in driver.find_elements_by_xpath("//span[@class='lazyload-wrapper']/img")])
      

    Ideally you have to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following Locator Strategies:

    • Using CSS_SELECTOR:

      driver.get('https://www.producthunt.com/topics/seo-tools?order=most-upvoted')
      print([my_elem.get_attribute("src") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "span.lazyload-wrapper > img")))])
      
    • Using XPATH in a single line:

      driver.get('https://www.producthunt.com/topics/seo-tools?order=most-upvoted')
      print([my_elem.get_attribute("src") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[@class='lazyload-wrapper']/img")))])
      
    • Note : You have to add the following imports :

      from selenium.webdriver.support.ui import WebDriverWait
      from selenium.webdriver.common.by import By
      from selenium.webdriver.support import expected_conditions as EC
      
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search