I’ve created a script in python to parse different links from a webpage. There are two section in the landing page. One is Top Experiences
and the other is More Experiences
. My current attempt can fetch the links from both the categories.
The type of links I wanna collect are (few of them) under the Top Experiences
section at this moment. However, when I traverse the links under More Experiences
section, I can see that they all lead to the page in which there is a section named Experiences
under which there are links that are similar to the links under Top Experiences
in the landing page. I wanna grab them all.
One such desirable link I’m after looks like: https://www.airbnb.com/experiences/20712?source=seo
.
My current attempt fetches the links from both the categories:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
URL = "https://www.airbnb.com/sitemaps/v2/experiences_pdp-L0-0"
def get_links(link):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
items = [urljoin(link,item.get("href")) for item in soup.select("div[style='margin-top:16px'] a._1f0v6pq")]
return items
if __name__ == '__main__':
for item in get_links(URL):
print(item)
How can I parse all the links under
Top Experiences
section along with the links underExperiences
section that can be found upon traversing the links underMore Experiences
?
Please check out the image if anything unclear. I used a pen available in paint so the writing may be a little hard to understand.
4
Answers
Seem like both the “Top Experience” and “More experiences” links share the same class so you can just use
.find_all
to obtain the links.Refactor code to meet your coding paradigm.
You can scrape from the
div
s withclass
"_12kw8n71"
:Output (Only top experiences and part of the links from more experiences, as the full output exceeds Stackoverflow’s character limit):
Process:
Get all
Top Experiences
linksGet all
More Experiences
linksSend a request to all
More Experiences
links one by one and get the links underExperiences
in each page.The
div
under which the links are present are same for all the pages have the same class_12kw8n71
Notes:
Your required links will be present in three lists
top_experiences
,more_experiences
andgenerated_experiences
I have added random delay to avoid getting blocked.
Not printing the lists as it will be too long.
top_experiences
– 50 linksmore_experiences
– 299 linksgenerated_experiences
-14950 linksThe solution is slightly tricky. It can be achieved in several ways. The one I find most useful is use the links under
More Experiences
withinget_links()
function recursively. All the links underMore Experiences
have a common keyword_pdp-
.So, when you define condional statement within the function to make the links sieve through the function
get_links()
recursively then theelse
block will produces the desired links. Most important thing to notice is that all the desired links are within the class_1f0v6pq
So the logic of getting the links is fairly easy .