skip to Main Content

I built a bot in Python that scrapes eBay product listing links from HTML.

Every link is pointing to the product page.
However, the first one is pointing to an error page like:

https://www.ebay.com/itm/01920391?epid=26039819083&_trkparms=ispr%3D1&hash=item3b542eae7a:g:FQkAAOSwK21gKvEZ&amdata=enc%3AAQAFAAACcBaobrjLl8XobRIiIML1V4Imu%252Fn%252BzU5L90Z278x5ickkrDx%252B2NLp21dg6hHbHAkGMYdiW1E6zjXxnQ0bf7c%252Fx%252Fvs5PW%252FYFw1ZdbGMi8wsGV6qXw8OFLl4Os1ACX3bnQxFkVpRib9hMb5gVyLha4q9L0xiporu5InbX0LrSgg7nCCCwtC7y3vOE3hc8PszsrXWLb5KFdj7%252BD98et12MdkEfMPFhJZuS%252BkFsp2esVTRCYctOhcwzPSdfzCOYprlr2miQc4czCv1Tcfs3LKUPJn8uQyRc%252BAnKY1oyTeYnJ7wYuGkBU%252FSVYjziLBaPhT%252FlVu0hR9ZX6OnAeRaJ1g0iCaDjrRXEXRwUO87riWeI8kExm1zzY7QicPeMnfWZdBvVhg05GOScPOlLTVPHakqGLX0y2GUXV6fkTLua3nSF5YBmLX%252FqdCxT6yS0dutVs5MPWvQYlN474hUzbubkZVAs7Y%252BBBEsHrGjVzCj0szZ6w1%252BHgkV5O9jrXGnyew5%252Bnxy7VCq5xEkUDIt1nSg996AeDksNmSNumhfsIOGltIXbqAbjqEUpPcVO%252BDPymxlh0iMxCZQalYnmljBRzKILYWkES0vfA14Gh5E7KWrztdC6WzEEFtgVuABakQ1eAOZnuEueqK6IakC%252BIfRbXv96Tv01IPDvwPeM8wMo6j8bMjY3D5KHS5EXPVdHKUnjCJiYCcVUqcKwhL6eN2MZ%252Bn9yxmWESUPN394NPrX%252FI2z7t0Bbo7iqmsWNQcyi0EHzDwJPMK%252FNSif8%252F2adRF7dT1JrbL9sryKSN2kv9OsdGQ0fMMC1LV3Ph43HivUJdqkgjGxqEqX5v1xQ%253D%253D%7Ccksum%3A25481541593068896952f4834d93a0bb998f5b5ba5fe%7Campid%3APL_CLK%7Cclp%3A2334524

Code

import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver


browser = webdriver.Chrome('/Users/admin/eBay/chromedriver')

#error = browser.find_element_by_xpath("//*[@id='wrapper']/div[1]/div/div/p")


url = "https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313&_nkw=New+Big+Frame+Square+Sunglasses+Fashion+Trend+All-match+Women%27s+Sunglasses+Cross-border+Hot+Sale+Sunglasses&_sacat=0"
r = requests.get(url)
soup = BeautifulSoup(r.content, features="lxml")

listings = soup.select("li a")

for a in listings:
    link = a["href"]
    if link.startswith("https://www.ebay.com/itm/"):
        page = browser.get(link)

How can I remove or ignore the error-page link?

3

Answers


  1. If you want to skip first link you can use list slicing with [1:]:

    ...
    
    for a in listings[1:]:  # <--- ignore first link
        link = a["href"]
        if link.startswith("https://www.ebay.com/itm/"):
            page = browser.get(link)
    
    Login or Signup to reply.
  2. cut out that link using the if statement

    import time
    import requests
    from bs4 import BeautifulSoup
    from selenium import webdriver
    
    
    browser = webdriver.Chrome('C:Program Files (x86)chromedriver.exe')
    
    #error = browser.find_element_by_xpath("//*[@id='wrapper']/div[1]/div/div/p")
    
    
    url = "https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313&_nkw=New+Big+Frame+Square+Sunglasses+Fashion+Trend+All-match+Women%27s+Sunglasses+Cross-border+Hot+Sale+Sunglasses&_sacat=0"
    r = requests.get(url)
    soup = BeautifulSoup(r.content, features="lxml")
    
    listings = soup.select("li a")
    b=1
    
    error_page ='https://www.ebay.com/itm/01920391?epid=26039819083&_trkparms=ispr%3D1&hash=item3b542eae7a:g:FQkAAOSwK21gKvEZ&amdata=enc%3AAQAFAAACcBaobrjLl8XobRIiIML1V4Imu%252Fn%252BzU5L90Z278x5ickkrDx%252B2NLp21dg6hHbHAkGMYdiW1E6zjXxnQ0bf7c%252Fx%252Fvs5PW%252FYFw1ZdbGMi8wsGV6qXw8OFLl4Os1ACX3bnQxFkVpRib9hMb5gVyLha4q9L0xiporu5InbX0LrSgg7nCCCwtC7y3vOE3hc8PszsrXWLb5KFdj7%252BD98et12MdkEfMPFhJZuS%252BkFsp2esVTRCYctOhcwzPSdfzCOYprlr2miQc4czCv1Tcfs3LKUPJn8uQyRc%252BAnKY1oyTeYnJ7wYuGkBU%252FSVYjziLBaPhT%252FlVu0hR9ZX6OnAeRaJ1g0iCaDjrRXEXRwUO87riWeI8kExm1zzY7QicPeMnfWZdBvVhg05GOScPOlLTVPHakqGLX0y2GUXV6fkTLua3nSF5YBmLX%252FqdCxT6yS0dutVs5MPWvQYlN474hUzbubkZVAs7Y%252BBBEsHrGjVzCj0szZ6w1%252BHgkV5O9jrXGnyew5%252Bnxy7VCq5xEkUDIt1nSg996AeDksNmSNumhfsIOGltIXbqAbjqEUpPcVO%252BDPymxlh0iMxCZQalYnmljBRzKILYWkES0vfA14Gh5E7KWrztdC6WzEEFtgVuABakQ1eAOZnuEueqK6IakC%252BIfRbXv96Tv01IPDvwPeM8wMo6j8bMjY3D5KHS5EXPVdHKUnjCJiYCcVUqcKwhL6eN2MZ%252Bn9yxmWESUPN394NPrX%252FI2z7t0Bbo7iqmsWNQcyi0EHzDwJPMK%252FNSif8%252F2adRF7dT1JrbL9sryKSN2kv9OsdGQ0fMMC1LV3Ph43HivUJdqkgjGxqEqX5v1xQ%253D%253D%7Ccksum%3A25481541593068896952f4834d93a0bb998f5b5ba5fe%7Campid%3APL_CLK%7Cclp%3A2334524'
    for a in listings:
        
        link = a["href"]
        if link.startswith("https://www.ebay.com/itm/") and link !=error_page:
    
            page = browser.get(link)
    
    
    Login or Signup to reply.
  3. I would have gone similar way to @SIM and relied on faster css filtering and using css classes (generally 2nd fastest way of matching on nodes in css after id).

    links = [i['href'] for i in soup.select('#srp-river-results .s-item__link')]
    

    The introduction of the leading id limits results to the actual listings block.

    If you are somehow worried that urls with other start strings might occur, which seems unlikely given the consistent design of these pages, you can add in a css attribute = value selector with ^ starts with operator:

    links = [i['href'] for i in soup.select('#srp-river-results .s-item__link[href^="https://www.ebay.com/itm/"]')]
    

    In case of wanting more info then set listings as

    listings = soup.select('#srp-river-results .s-item')
    

    Then access links with:

    links = [listing.select_one('.s-item__link[href^="https://www.ebay.com/itm/"]')['href'] for listing in listings]
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search