I’m a brand new coder who was tasked (by my company) with making a web scraper for eBay, to assist the CFO in finding inventory items when we need them. I’ve got it developed to scrape from multiple pages, but when the Pandas DataFrame loads, the number of results does not match how many pages it’s supposed to be scraping. Here is the code (I am using iPads just for the sheer volume and degree of variance in the results):
import time
import requests
import re
import pandas as pd
from bs4 import BeautifulSoup
data = []
# searchkey = input()
# base_url = 'https://www.ebay.com/sch/i.html?_nkw=' + searchkey + '&_sacat=0&_ipg=240
base_url = 'https://www.ebay.com/sch/i.html?_nkw=ipads&_sacat=0&_ipg=60'
for page in range(1, 11):
page_url = base_url + '&_pgn=' + str(page)
time.sleep(10)
soup = BeautifulSoup(requests.get(page_url).text)
for links in soup.select('.srp-results li.s-item'):
item_url = links.a['href']
soup2 = BeautifulSoup(requests.get(item_url).text)
for content in soup2.select('.lsp-c'):
data.append({
'item_name' : content.select_one('h1.x-item-title__mainTitle > span').text,
'name' : 'Click Here to see Webpage',
'url' : str(item_url),
'hot' : "Hot!" if content.select_one('div.d-urgency') else "",
'condition' : content.select_one('span.clipped').text,
'price' : content.select_one('div.x-price-primary > span').text,
'make offer' : 'Make Offer' if content.select_one('div.x-offer-action') else str('Contact Seller')
})
df = pd.DataFrame(data)
df['link'] = df['name'] + '#' + df['url']
def make_clickable_both(val):
name, url = val.split('#')
return f'<a href="{url}">{name}</a>'
df2 = df.drop(columns=['name', 'url'])
df2.style.format({'link': make_clickable_both})
The results of these appear like so:
item_name | hot | condition | price | make offer | link | |
---|---|---|---|---|---|---|
0 | Apple iPad Air 2 2nd WiFi + Ce… | Hot! | Good – Refurbished | US $169.99 | Contact Seller | Click Here to see Webpage |
1 | Apple iPad 2nd 3rd 4th Generat… | Hot! | Used | US $64.99 | Contact Seller | Click Here to see Webpage |
2 | Apple iPad 6th 9.7" 2018 Wifi … | Very Good – Refurbished | US $189.85 | Contact Seller | Click Here to see Webpage | |
3 | Apple iPad Air 1st 2nd Generat… | Hot! | Used | US $54.89/ea | Contact Seller | Click Here to see Webpage |
4 | Apple 10.2" iPad 9th Generatio… | Hot! | Open box | US $269.00 | Contact Seller | Click Here to see Webpage |
… | ||||||
300 | Apple iPad 8th 10.2" Wifi or… | Good – Refurbished | US $229.85 | Contact Seller | Click Here to see Webpage |
Which is great! That last column is even a clickable link, just as the function defines, and operates properly. However, based off of my URL it’s just about half the data I should have received.
So, in the URL, the two key things related to this are page_url = base_url + '&_pgn=' + str(page)
, which is how I determine the page number for each URL to get the list of links off of, and &_ipg=60
, which is what determines how many items are loaded on the page (eBay has 3 options for this: 60, 120, 240). So based on my current settings (pagination giving me 10 pages and item amount set to 60), I should be seeing roughly 600 results or so, but Instead I got 300. I added the timer to see if letting it load for a little longer or something between each page would help me get all the results, but I’ve had no such luck. Anyone got ideas about what I did wrong, or what I can do to improve? Any bit of info is appreciated!
2
Answers
I actually dug more into what popped up when parsing the HTML, and discovered it was because of eBay denying access passed 5 pages of results to bots! So, changing my code to add:
it actually fixes the issue! Should have known.
Starting page 5, pages seem to be rendered differently and
soup.select('.srp-results li.s-item')
always returns an empty list (of urls).That is why
data
length remains stuck at 300, even though there are more results.So, there is nothing wrong with your code and there is no need to pause for 10 seconds.
Leaving the code unchanged, your best option is to set
&_ipg
to240
, you get more, if not all, results (after a certain time):