skip to Main Content

The script below is meant to look through ebay listings on the ebay search page. The search page is just a list, so I am trying to loop through each li tag and add the content to a variable. For some reason this script doesn’t seem to want to work and I’m not sure why.

from urllib.request import urlopen
from bs4 import BeautifulSoup

# specify the url
url = "https://www.ebay.co.uk/sch/i.html?_from=R40&_nkw=funko+gamora+199&_sacat=0&LH_Sold=1&LH_Complete=1&rt=nc&LH_PrefLoc=1&_ipg=200"

# Connect to the website and return the html to the variable ‘page’
try:
    page = urlopen(url)
except:
    print("Error opening the URL")

# parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')

# Take out the <div> of name and get its value
content = soup.find('ul', {"class": "srp-results srp-list clearfix"})

#print(content)

article = ''
for i in content.findAll('li'):
    article = article + ' ' +  i.text
print(article)

# Saving the scraped text
with open('scraped_text.txt', 'w') as file:
    file.write(article)

Can anyone see where I’m going wrong?

2

Answers


  1. This is what the response looks like:

    print(soup.text)
    

    Security measureSkip to main content Please verify yourself to continueerror To keep eBay a safe place to buy and sell, we will occasionally ask you to verify yourself. This helps us to block unauthorised users from entering our site.Please verify yourselfIf you’re having difficulties with the rendering of images on the above verification page, eBay suggests using the latest version of your browser or an alternate browser listed in here Additional site navigationAbout eBayAnnouncementsCommunitySafety CentreResolution CentreSeller CentreVeRO: Protecting Intellectual PropertyPoliciesHelp & ContactSite MapCopyright © 1995-2021 eBay Inc. All Rights Reserved. User Agreement, Privacy, Cookies and AdChoiceNorton Secured – powered by Verisign

    It’s an error on ebay-end, your code looks fine at first glance. Also, note that webscraping is a grey area and some companies do not allow it. You might need to bypass security measures.

    Also, you should comment your code in such way that tells the reader WHY your code does what it does, not what it does. You don’t have to comment things like "soup = BeautifulSoup(page, ‘html.parser’)"

    Edit: I forgot to mention, error appears, because

    content = soup.find('ul', {"class": "srp-results srp-list clearfix"})
    

    found no results.

    Login or Signup to reply.
  2. Most likely you get a CAPTCHA or IP rate limit. Ways to avoid being blocked.

    If you need to extract all results from all pages using pagination, the solution to this would be to use an non-token pagination and test for something (button, element) that will result in an exit:

    if soup.select_one(".pagination__next"):   # checking for 'next page' button
        params['_pgn'] += 1                    # if there is a button, it will go to the next page
    else:                                      # otherwise, the loop exits
        break
    

    You can also add a condition for exiting the loop by the number of retrieved pages by adding a limit:

    limit = 5                     # page limit
    
    # other code
    
    if params['_pgn'] == limit:   # if the page number is equal to the specified limit, the loop is terminated
        break
    

    Code example with pagination in the online IDE.

    from bs4 import BeautifulSoup
    import requests, json, lxml
    
    # https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36",
    }
       
    params = {
        "_nkw": "iphone_14",    # search query example
        "LH_Sold": "1",         # shows sold items
        "_pgn": 1               # page number
    }
    
    data = []
    limit = 5                 # page limit (if needed)
    while True:
        page = requests.get("https://www.ebay.co.uk/sch/i.html", params=params, headers=headers, timeout=30)
        soup = BeautifulSoup(page.text, "lxml")
        
        print(f"Extracting page: {params['_pgn']}")
    
        print("-" * 10)
        
        for products in soup.select(".s-item__info"):
            title = products.select_one(".s-item__title span").text
            price = products.select_one(".s-item__price").text
            
            data.append({
              "title" : title,
              "price" : price
            })
    
        if params['_pgn'] == limit:
           break
        if soup.select_one(".pagination__next"):
            params['_pgn'] += 1
        else:
            break
    
    print(json.dumps(data, indent=2, ensure_ascii=False))
    

    Example output:

    [
      {
        "title": "Case For iPhone 11 Pro Max 14Pro 8 7  SE 2022  Shockproof Silicone Cover colours",
        "price": "£3.99"
      },
      {
        "title": "Ring Holder Magnetic Shockproof Case Cover For iPhone  14Pro Max 11 XR  XS 12 13",
        "price": "£5.99 to £6.99"
      },
      {
        "title": "Apple iPhone 14 - 128GB - Space Black (Unlocked) A2890 (GSM)",
        "price": "£641.95"
      },
      other results ...
    ]
    

    As an alternative, you can use Ebay Organic Results API from SerpApi. It’s a paid API with a free plan that handles blocks and parsing on their backend.

    Example code with pagination:

    from serpapi import EbaySearch
    import json
    
    params = {
        "api_key": "...",                 # serpapi key, https://serpapi.com/manage-api-key   
        "engine": "ebay",                 # search engine
        "ebay_domain": "ebay.co.uk",      # ebay domain
        "_nkw": "iphone_14",              # search query
        "LH_Sold": "1",                   # shows sold items
        "_pgn": 1                         # page number
    }
    
    search = EbaySearch(params)           # where data extraction happens
    
    limit = 5
    page_num = 0
    data = []
    
    while True:
        results = search.get_dict()     # JSON -> Python dict
    
        if "error" in results:
            print(results["error"])
            break
        
        for organic_result in results.get("organic_results", []):
            title = organic_result.get("title")
            price = organic_result.get("price")
    
            data.append({
              "title" : title,
              "price" : price
            })
                        
        page_num += 1
        print(page_num)
    
        if params['_pgn'] == limit:
           break
        if "next" in results.get("pagination", {}):
            params['_pgn'] += 1
        else:
            break
    
    print(json.dumps(data, indent=2, ensure_ascii=False))
    

    Output:

    [
      {
        "title": "Apple iPhone 14 Plus Midnight - 512GB - Unlocked - MINT CONDITION",
        "price": {
          "raw": "£749.99",
          "extracted": 749.99
        }
      },
      {
        "title": "New listingApple iPhone 14 Plus (PRODUCT)RED - 128GB (Unlocked)",
        "price": {
          "raw": "£750.00",
          "extracted": 750.0
        }
      other results ...
    ]
    

    There’s a 13 ways to scrape any public data from any website blog post if you want to know more about website scraping.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search