I want to find all hyperlinks in a Wikipedia page and save them in a list in a recursive way within the Spanish locale. That is, taking all links in a first Spanish Wikipedia page and from each of them going to another page and so on recursively while saving all of them in a list.
The idea is having an infinite or endless hyperlink gathering tool that I could stop whenever I consider I have enough links.
Up to now I think I have the first steps, which are the trigger Spanish Wikipedia page and its link, which I have crawled in search for its hyperlinks, but I do not know how to make it recursive and make it go to each hyperlink repeating the process again and again.
Here is my code:
url = "https://es.wikipedia.org/wiki/Olula_del_Río" # URL of the trigger article
webpage = requests.get(url)
html_content = webpage.content
# Parse the webpage content
soup = BeautifulSoup(html_content, "lxml") # Another parser is 'html.parser'
#print(soup.prettify())
# Extract only the tags containing the hyperlinks
urls_list = []
for url in soup.find_all('a', href=True):
url = url.get('href')
url = unquote(url) # URL encoding
urls_list.append(url)
#print(url)
Now I would like to enter each hyperlink in urls_list
and repeat the same process with the hyperlinks in the corresponding page and append it to the list.
Is there a manageable way to do this?
2
Answers
This should work:
You were pretty much there, just had to recursively call the function and check the links before function is called on them as well.
This should work with various websites, with a few small changes.
It should just work like this on any language of wikipedia.
Edit: Be aware, my computer hit a recursion limit using this function, so it may just not be possible to recursively get every single link like this after a certain amount.
Maybe recursive parse might work (but most probably you will encounter recursion limit), here is example of parser using
asyncio
that uses 16 workers (configurable) to scrap the wikipedia in parallel.I was able to get ~200k links in just few minutes (stored in
out.txt
).Note: you can cancel the program using
Ctrl+C
when you think you have enough links.This creates
out.txt
with following content: