skip to Main Content

I have almost no webscraping experience, and wasn’t able to solve this using BeautifulSoup, so I’m trying selenium (installed it today). I’m trying to scrape sold items on eBay. I’m trying to scrape:

https://www.ebay.com/sch/i.html?_from=R40&_nkw=oakley+sunglasses&_sacat=0&Brand=Oakley&rt=nc&LH_Sold=1&LH_Complete=1&_ipg=200&_oaa=1&_fsrp=1&_dcat=79720

Here is my code where I load in html code and convert to selenium html:

    ebay_url = 'https://www.ebay.com/sch/i.html?_from=R40&_nkw=oakley+sunglasses&_sacat=0&Brand=Oakley&rt=nc&LH_Sold=1&LH_Complete=1&_ipg=200&_oaa=1&_fsrp=1&_dcat=79720'

    html = requests.get(ebay_url)
    #print(html.text)

    driver = wd.Chrome(executable_path=r'/Users/mburley/Downloads/chromedriver')
    driver.get(ebay_url)

Which correctly opens a new chrome session at the correct url. I’m working on getting the titles, prices, and date sold and then loading it into a csv file. Here is the code I have for those:

    # Find all div tags and set equal to main_data
    all_items = driver.find_elements_by_class_name("s-item__info clearfix")[1:]
    #print(main_data)

    # Loop over main_data to extract div classes for title, price, and date
    for item in all_items:
    date = item.find_element_by_xpath("//span[contains(@class, 'POSITIVE']").text.strip()
    title = item.find_element_by_xpath("//h3[contains(@class, 's-item__title s-item__title--has-tags']").text.strip()
    price = item.find_element_by_xpath("//span[contains(@class, 's-item__price']").text.strip()

    print('title:', title)
    print('price:', price)
    print('date:', date)
    print('---')
    data.append( [title, price, date] )

Which just returns []. I think ebay may be blocking my IP, but the html code loads in and looks correct. Hopefully someone can help! Thanks!

2

Answers


  1. You can use the below code to scrape the details. also you can use pandas to store data in csv file.

    Code :

    ebay_url = 'https://www.ebay.com/sch/i.html?_from=R40&_nkw=oakley+sunglasses&_sacat=0&Brand=Oakley&rt=nc&LH_Sold=1&LH_Complete=1&_ipg=200&_oaa=1&_fsrp=1&_dcat=79720'
    
    html = requests.get(ebay_url)
    # print(html.text)
    
    driver = wd.Chrome(executable_path=r'/Users/mburley/Downloads/chromedriver')
    driver.maximize_window()
    driver.implicitly_wait(30)
    driver.get(ebay_url)
    
    
    wait = WebDriverWait(driver, 20)
    sold_date = []
    title = []
    price = []
    i = 1
    for item in driver.find_elements(By.XPATH, "//div[contains(@class,'title--tagblock')]/span[@class='POSITIVE']"):
        sold_date.append(item.text)
        title.append(driver.find_element_by_xpath(f"(//div[contains(@class,'title--tagblock')]/span[@class='POSITIVE']/ancestor::div[contains(@class,'tag')]/following-sibling::a/h3)[{i}]").text)
        price.append(item.find_element_by_xpath(f"(//div[contains(@class,'title--tagblock')]/span[@class='POSITIVE']/ancestor::div[contains(@class,'tag')]/following-sibling::div[contains(@class,'details')]/descendant::span[@class='POSITIVE'])[{i}]").text)
        i = i + 1
    
    print(sold_date)
    print(title)
    print(price)
    
    data = {
             'Sold_date': sold_date,
             'title': title,
             'price': price
            }
    df = pd.DataFrame.from_dict(data)
    df.to_csv('out.csv', index = 0)
    

    Imports :

    import pandas as pd
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    
    Login or Signup to reply.
  2. It is not necessary to use Selenium for eBay scraping, as the data is not rendered by JavaScript thus can be extracted from plain HTML. It is enough to use BeautifulSoup web scraping library.

    Keep in mind that problems with site parsing may arise when you try to request a site multiple times. eBay may consider that this is a bot that sends a request (not a real user).

    To avoid this, one of the ways is to send headers that contain user-agent in the request, then the site will assume that you’re a user and display information.

    As an additional step is to rotate those user-agents. The ideal scenario is to use proxies in combo with rotated user-agents (besides CAPTCHA solver)

    from bs4 import BeautifulSoup
    import requests, json, lxml
    
    # https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36"
        }
        
    params = {
        '_nkw': 'oakley+sunglasses',      # search query  
        'LH_Sold': '1',                   # shows sold items
        '_pgn': 1                         # page number
    }
    
    data = []
    
    while True:
        page = requests.get('https://www.ebay.com/sch/i.html', params=params, headers=headers, timeout=30)
        soup = BeautifulSoup(page.text, 'lxml')
        
        print(f"Extracting page: {params['_pgn']}")
    
        print("-" * 10)
        
        for products in soup.select(".s-item__info"):
            title = products.select_one(".s-item__title span").text
            price = products.select_one(".s-item__price").text
            link = products.select_one(".s-item__link")["href"]
            
            data.append({
              "title" : title,
              "price" : price,
              "link" : link
            })
    
        if soup.select_one(".pagination__next"):
            params['_pgn'] += 1
        else:
            break
    
        print(json.dumps(data, indent=2, ensure_ascii=False)
    

    Example output

    Extracting page: 1
    ----------
    [
      {
        "title": "Shop on eBay",
        "price": "$20.00",
        "link": "https://ebay.com/itm/123456?hash=item28caef0a3a:g:E3kAAOSwlGJiMikD&amdata=enc%3AAQAHAAAAsJoWXGf0hxNZspTmhb8%2FTJCCurAWCHuXJ2Xi3S9cwXL6BX04zSEiVaDMCvsUbApftgXEAHGJU1ZGugZO%2FnW1U7Gb6vgoL%2BmXlqCbLkwoZfF3AUAK8YvJ5B4%2BnhFA7ID4dxpYs4jjExEnN5SR2g1mQe7QtLkmGt%2FZ%2FbH2W62cXPuKbf550ExbnBPO2QJyZTXYCuw5KVkMdFMDuoB4p3FwJKcSPzez5kyQyVjyiIq6PB2q%7Ctkp%3ABlBMULq7kqyXYA"
      },
      {
        "title": "Oakley X-metal Juliet  Men's Sunglasses",
        "price": "$280.00",
        "link": "https://www.ebay.com/itm/265930582326?hash=item3deab2a936:g:t8gAAOSwMNhjRUuB&amdata=enc%3AAQAHAAAAoH76tlPncyxembf4SBvTKma1pJ4vg6QbKr21OxkL7NXZ5kAr7UvYLl2VoCPRA8KTqOumC%2Bl5RsaIpJgN2o2OlI7vfEclGr5Jc2zyO0JkAZ2Gftd7a4s11rVSnktOieITkfiM3JLXJM6QNTvokLclO6jnS%2FectMhVc91CSgZQ7rc%2BFGDjXhGyqq8A%2FoEyw4x1Bwl2sP0viGyBAL81D2LfE8E%3D%7Ctkp%3ABk9SR8yw1LH9YA"
      },
      {
        "title": " Used Oakley PROBATION Sunglasses Polished Gold/Dark Grey  (OO4041-03)",
        "price": "$120.00",
        "link": "https://www.ebay.com/itm/334596701765?hash=item4de7847e45:g:d5UAAOSw4YtjTfEE&amdata=enc%3AAQAHAAAAoItMbbzfQ74gNUiinmOVnzKlPWE%2Fc54B%2BS1%2BrZpy6vm5lB%2Bhvm5H43UFR0zeCU0Up6sPU2Wl6O6WR0x9FPv5Y1wYKTeUbpct5vFKu8OKFBLRT7Umt0yxmtLLMWaVlgKf7StwtK6lQ961Y33rf3YuQyp7MG7H%2Fa9fwSflpbJnE4A9rLqvf3hccR9tlWzKLMj9ZKbGxWT17%2BjyUp19XIvX2ZI%3D%7Ctkp%3ABk9SR8yw1LH9YA"
      },
    

    As an alternative, you can use Ebay Organic Results API from SerpApi. It`s a paid API with a free plan that handles blocks and parsing on their backend.

    Example code that paginates through all pages:

    from serpapi import EbaySearch
    import os, json
    
    params = {
        "api_key": os.getenv("API_KEY"),      # serpapi api key    
        "engine": "ebay",                     # search engine
        "ebay_domain": "ebay.com",            # ebay domain
        "_nkw": "oakley+sunglasses",          # search query
        "_pgn": 1,                             # page number           
        "LH_Sold": "1"                        # shows sold items
    }
    
    search = EbaySearch(params)        # where data extraction happens
    
    page_num = 0
    
    data = []
    
    while True:
        results = search.get_dict()     # JSON -> Python dict
    
        if "error" in results:
            print(results["error"])
            break
        
        for organic_result in results.get("organic_results", []):
            link = organic_result.get("link")
            price = organic_result.get("price")
    
            data.append({
              "price" : price,
              "link" : link
            })
                        
        page_num += 1
        print(page_num)
        
        if "next" in results.get("pagination", {}):
            params['_pgn'] += 1
    
        else:
            break
    
        print(json.dumps(data, indent=2))
    

    Output:

    [
       {
        "price": {
          "raw": "$68.96",
          "extracted": 68.96
        },
        "link": "https://www.ebay.com/itm/125360598217?epid=20030526224&hash=item1d3012ecc9:g:478AAOSwCt5iqgG5&amdata=enc%3AAQAHAAAA4Ls3N%2FEH5OR6w3uoTlsxUlEsl0J%2B1aYmOoV6qsUxRO1d1w3twg6LrBbUl%2FCrSTxNOjnDgIh8DSI67n%2BJe%2F8c3GMUrIFpJ5lofIRdEmchFDmsd2I3tnbJEqZjIkWX6wXMnNbPiBEM8%2FML4ljppkSl4yfUZSV%2BYXTffSlCItT%2B7ZhM1fDttRxq5MffSRBAhuaG0tA7Dh69ZPxV8%2Bu1HuM0jDQjjC4g17I3Bjg6J3daC4ZuK%2FNNFlCLHv97w2fW8tMaPl8vANMw8OUJa5z2Eclh99WUBvAyAuy10uEtB3NDwiMV%7Ctkp%3ABk9SR5DKgLD9YA"
      },
      {
        "price": {
          "raw": "$62.95",
          "extracted": 62.95
        },
        "link": "https://www.ebay.com/itm/125368283608?epid=1567457519&hash=item1d308831d8:g:rnsAAOSw7PJiqMQz&amdata=enc%3AAQAHAAAA4AwZhKJZfTqrG8VskZL8rtfsuNtZrMdWYpndpFs%2FhfrIOV%2FAjLuzNzaMNIvTa%2B6QUTdkOwTLRun8n43cZizqtOulsoBLQIwy3wf19N0sHxGF5HaIDOBeW%2B2sobRnzGdX%2Fsmgz1PRiKFZi%2BUxaLQpWCoGBf9n8mjcsFXi3esxbmAZ8kenO%2BARbRBzA2Honzaleb2tyH5Tf8%2Bs%2Fm5goqbon%2FcEsR0URO7BROkBUUjDCdDH6fFi99m6anNMMC3yTBpzypaFWio0u2qu5TgjABUfO1wzxb4ofA56BNKjoxttb7E%2F%7Ctkp%3ABk9SR5DKgLD9YA"
      },
      # ...
    ]
    

    Disclaimer, I work for SerpApi.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search