So I’m building an eBay webscraper for work (I should note that I am incredibly new to programming in general, and am entirely self-taught using the internet), and I have made it functionin. I am building this with Python 3.11, in a Jupyter Notebook within Azure Data Studio. However, it returns in the csv (and consequently the Excel sheet) with multiple empty rows:
name,condition,price,options,shipping
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
['Apple iPad 5 (5th Gen -2017 Model) -32GB -128GB - Wi-Fi + Cellular - Good'],['Good - Refurbished'],$149.00 to $199.00,['Buy It Now'],
,,,,
,,,,
,,,,
['Apple iPad Air 2 2nd WiFi + Cellular Unlocked 16GB 32GB 64GB 128GB - Good'],['Good - Refurbished'],$139.99 to $199.99,['Buy It Now'],['Free shipping']
,,,,
,,,,
,,,,
['Apple iPad 2nd 3rd 4th Generation 16GB 32GB 64GB 128GB PICK:GB - Color *Grade B*'],['Pre-Owned'],$64.99 to $199.99,['Buy It Now'],['Free shipping']
,,,,
,,,,
,,,,
etc. . .
Here is my code:
import time
import requests
import pandas
import lxml
import selenium
import html5lib
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
options = webdriver.ChromeOptions()
options.headless = True
options.page_load_strategy = 'none'
chrome_path = ChromeDriverManager().install()
s = Service(chrome_path)
driver = Chrome(options=options, service=s) # headers=headers once I can get it working again
driver.implicitly_wait(5)
browser = webdriver.Chrome(service=s)
# searchkey = input() <-- this commented out portion is for when I have got it more functional so that I can do a more dynamic url
# url = 'https://www.ebay.com/sch/i.html?_nkw=' + searchkey + '&_sacat=0&_ipg=240'
url = 'https://www.ebay.com/sch/i.html?_nkw=ipads&_sacat=0&_ipg=240'
data = []
browser.get(url)
time.sleep(10)
content = browser.find_element(By.CSS_SELECTOR, "div[class*='srp-river-results']")
item_contents = content.find_elements(By.TAG_NAME, "li")
def extract_data(content):
name = content.find_elements(By.CSS_SELECTOR, "div[class*='s-item__title']>span")
if name:
name = [attr.text for attr in name]
else:
name = None
condition = content.find_elements(By.CSS_SELECTOR, "div[class*='s-item__subtitle']>span")
if condition:
condition = [attr.text for attr in condition]
else:
condition = None
price = content.find_elements(By.CSS_SELECTOR, "span[class*='s-item__price']")
if price:
price = price[0].text
else:
price = None
purchase_options = content.find_elements(By.CSS_SELECTOR, "span[class*='s-item__purchaseOptionsWithIcon']")
if purchase_options:
purchase_options = [attr.text for attr in purchase_options]
else:
purchase_options = None
shipping = content.find_elements(By.CSS_SELECTOR, "span[class*='s-item__logisticsCost']")
if shipping:
shipping = [attr.text for attr in shipping]
else:
shipping = None
return {
"name": name,
"condition": condition,
"price": price,
"options": purchase_options,
"shipping": shipping
}
for content in item_contents:
extracted_data = extract_data(content)
data.append(extracted_data)
df = pd.DataFrame(data)
df.to_csv("frame.csv", index=False)
Now, looking into the HTML with the Inspect tool, I discovered what I think the problem is. As I am using just the "li" tag in the "item_contents" variable, it seems to be attempting to pull the data sets for the river/carousel at the top (which is in the same div class and is stored in a "li" element), and then within each item card there is a potential for a "Top Rated" status, whose element includes 3 additional "li" elements.
The problem is, I don’t actually know how to fix this? I attempted to adjust the tag selector to include the "data-viewport" bit, but that didn’t seem to work in either By.CSS_SELECTOR or By.TAG_NAME, like so:
item_contents = content.find_elements(By.TAG_NAME, "li[data-viewport]")
item_contents = content.find_elements(By.TAG_NAME, "li[data-viewport*='trackableId']")
item_contents = content.find_elements(By.CSS_SELECTOR, "li[data-viewport]")
item_contents = content.find_elements(By.CSS_SELECTOR, "li[data-viewport*='trackableId']")
giving me entirely blank dataframes instead. I’ve tried searching how to better select my CSS elements, but I am struggling to get what I want, or at least the answers I’ve found seem to be geared towards different problems than mine. Using dropna works to just clear out those empty rows, but I feel like there must be a better way for me to select my tags or something so that I don’t end up with data like this? If there isn’t, though, I can just continue like that. Just wanting to learn how to better program, I suppose. Any assistance would be great! Thanks in advance!
2
Answers
Change your selection strategy and use
dict
instead of severallists
:But it do not need
selenium
overhead, simply userequests
:Output
Based on HedgeHog answer.
What I can highly recommend is using xpath and lxml library to parse html instead of BeautifulSoup, as it is much faster.
Comparison betwean