How can I scrape website hyperlinks in a recursive way - Html

Javi
February 23, 2023
270 views
1 vote
2 Answers

I want to find all hyperlinks in a Wikipedia page and save them in a list in a recursive way within the Spanish locale. That is, taking all links in a first Spanish Wikipedia page and from each of them going to another page and so on recursively while saving all of them in a list.

The idea is having an infinite or endless hyperlink gathering tool that I could stop whenever I consider I have enough links.

Up to now I think I have the first steps, which are the trigger Spanish Wikipedia page and its link, which I have crawled in search for its hyperlinks, but I do not know how to make it recursive and make it go to each hyperlink repeating the process again and again.

Here is my code:

url = "https://es.wikipedia.org/wiki/Olula_del_Río" # URL of the trigger article 
webpage = requests.get(url)
html_content = webpage.content

# Parse the webpage content
soup = BeautifulSoup(html_content, "lxml") # Another parser is 'html.parser'
#print(soup.prettify())

# Extract only the tags containing the hyperlinks
urls_list = []
for url in soup.find_all('a', href=True):
    url = url.get('href')
    url = unquote(url) # URL encoding
    urls_list.append(url)
    #print(url)

Now I would like to enter each hyperlink in urls_list and repeat the same process with the hyperlinks in the corresponding page and append it to the list.

Is there a manageable way to do this?

Answers

This should work:

from bs4 import BeautifulSoup
import requests

urls = []
def scrape_links(website):
    #Get Site data and make it into a BS4 object
    webpage = requests.get(website)
    html_content = webpage.content
    soup = BeautifulSoup(html_content, "lxml")

    for tag in soup.find_all('a', href=True): #For all <a> tags with href attributes on the current page
        print(tag) #print tag to check tag is correct
        href = tag.get("href") #Get the href of the a tag
        if href.startswith("/"): #If it starts with / then it must be a link to wikipedia
            href = "https://es.wikipedia.org" + href
        if href.startswith("#"): #If it starts with # (HTML class link) then skip
            continue #Skip this link
        print(href) #Print the link to check
        if href not in urls: #if link has not already been seen
            urls.append(href) #add url to the list
            scrape_links(href) #recursive function call

scrape_links("https://es.wikipedia.org/wiki/Olula_del_Río")

You were pretty much there, just had to recursively call the function and check the links before function is called on them as well.

This should work with various websites, with a few small changes.
It should just work like this on any language of wikipedia.

Edit: Be aware, my computer hit a recursion limit using this function, so it may just not be possible to recursively get every single link like this after a certain amount.

Maybe recursive parse might work (but most probably you will encounter recursion limit), here is example of parser using asyncio that uses 16 workers (configurable) to scrap the wikipedia in parallel.

I was able to get ~200k links in just few minutes (stored in out.txt).

Note: you can cancel the program using Ctrl+C when you think you have enough links.

import asyncio
import aiohttp
import urllib.parse
from bs4 import BeautifulSoup

main_queue = asyncio.Queue()
parsed_links_queue = asyncio.Queue()
parsed_links = set()

session = None
f_out = None
visited_urls = 0

async def get_url(url):
    global visited_urls
    try:
        async with session.get(url) as resp:
            visited_urls += 1
            return await resp.text()
    except:
        print(f'Bad URL: {url}')

async def worker():
    while True:
        url = await main_queue.get()
        soup = BeautifulSoup(await get_url(url), 'html.parser')

        for a in soup.select('a[href]'):
            href = a['href']
            if href.startswith('/wiki/') and ':' not in href:
                parsed_links_queue.put_nowait('https://es.wikipedia.org' + href)

        main_queue.task_done()

async def consumer():
    while True:
        url = await parsed_links_queue.get()

        if url not in parsed_links:
            print(urllib.parse.unquote(url), file=f_out, flush=True)  # <-- print the url to file
            parsed_links.add(url)
            main_queue.put_nowait(url)

        parsed_links_queue.task_done()


async def main():
    global session, f_out

    seed_url = 'https://es.wikipedia.org/wiki/Olula_del_R%C3%ADo'
    parsed_links.add(seed_url)

    with open('out.txt', 'w') as f_out:
        async with aiohttp.ClientSession() as session:
            workers = {asyncio.create_task(worker()) for _ in range(16)}
            c = asyncio.create_task(consumer())

            main_queue.put_nowait(seed_url)
            print('Initializing...')
            await asyncio.sleep(5)

            while main_queue.qsize():
                print(f'Visited URLs: {visited_urls:>7}  Known URLs (saved in out.txt): {len(parsed_links):>7}', flush=True)
                await asyncio.sleep(0.1)

    await main_queue.join()
    await parsed_links_queue.join()

asyncio.run(main())

This creates out.txt with following content:


...

https://es.wikipedia.org/wiki/Eduardo_Asquerino
https://es.wikipedia.org/wiki/Francisco_Luis_de_Retes
https://es.wikipedia.org/wiki/Francisco_Pérez_Echevarría
https://es.wikipedia.org/wiki/Pedro_Marquina
https://es.wikipedia.org/wiki/Pedro_I_el_Cruel_(serie_de_televisión)
https://es.wikipedia.org/wiki/Ramón_Madaula
https://es.wikipedia.org/wiki/Pedro_Fernández_de_Castro
https://es.wikipedia.org/wiki/Cañete
https://es.wikipedia.org/wiki/Luis_Vicente_Díaz_Martín
https://es.wikipedia.org/wiki/José_María_Montoto_López_Vigil
https://es.wikipedia.org/wiki/José_Velázquez_y_Sánchez
https://es.wikipedia.org/wiki/Revista_de_Ciencias,_Literatura_y_Artes_(1855-1860)

...

Please signup or login to give your own answer.

Click here to cancel reply.

How can I scrape website hyperlinks in a recursive way – Html

Answers