I am trying to use regex function in python to filter out links from the html that I scraped on eBay website.
My question is how can I filter out those links with using following pattern: https://www.ebay.com/itm/ + all other characters.
I am getting successfully the https://www.ebay.com/itm/ part but I am not sure how to do the rest.
Python version that I am using: 3.8.8.
Here is the code:
from bs4 import BeautifulSoup
import requests
import re
url = 'https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313&_nkw=Universal+Adjustable+Hand+Shower+Holder+Suction+Cup+Holder+Full+Plating+Shower+Rail+Head+Holder+Bathroom+Bracket+Stable+rotation&_sacat=0'
r = requests.get(url)
soup = BeautifulSoup(r.content, features="lxml")
listings = soup.find_all('li')
pattern = 'https://www.ebay.com/itm/'
results = re.findall('https://www.ebay.com/itm/', str(listings))
print(results)
2
Answers
To get links that starts with
https://www.ebay.com/itm/
you can do:Prints:
You could do a more efficient filtering within css using ^ starts with operator to identify the appropriate links starting with that string. Use a set comprehension to return only unique items.
from bs4 import BeautifulSoup
import requests
If you wish to specify the href is a descendant of a li then add that into the selector with a descendant combinator and type selector: