skip to Main Content

I want to find all hyperlinks in a Wikipedia page and save them in a list in a recursive way within the Spanish locale. That is, taking all links in a first Spanish Wikipedia page and from each of them going to another page and so on recursively while saving all of them in a list.

The idea is having an infinite or endless hyperlink gathering tool that I could stop whenever I consider I have enough links.

Up to now I think I have the first steps, which are the trigger Spanish Wikipedia page and its link, which I have crawled in search for its hyperlinks, but I do not know how to make it recursive and make it go to each hyperlink repeating the process again and again.

Here is my code:

url = "https://es.wikipedia.org/wiki/Olula_del_Río" # URL of the trigger article 
webpage = requests.get(url)
html_content = webpage.content

# Parse the webpage content
soup = BeautifulSoup(html_content, "lxml") # Another parser is 'html.parser'
#print(soup.prettify())

# Extract only the tags containing the hyperlinks
urls_list = []
for url in soup.find_all('a', href=True):
    url = url.get('href')
    url = unquote(url) # URL encoding
    urls_list.append(url)
    #print(url)

Now I would like to enter each hyperlink in urls_list and repeat the same process with the hyperlinks in the corresponding page and append it to the list.

Is there a manageable way to do this?

2

Answers


  1. This should work:

    from bs4 import BeautifulSoup
    import requests
    
    urls = []
    def scrape_links(website):
        #Get Site data and make it into a BS4 object
        webpage = requests.get(website)
        html_content = webpage.content
        soup = BeautifulSoup(html_content, "lxml")
    
        for tag in soup.find_all('a', href=True): #For all <a> tags with href attributes on the current page
            print(tag) #print tag to check tag is correct
            href = tag.get("href") #Get the href of the a tag
            if href.startswith("/"): #If it starts with / then it must be a link to wikipedia
                href = "https://es.wikipedia.org" + href
            if href.startswith("#"): #If it starts with # (HTML class link) then skip
                continue #Skip this link
            print(href) #Print the link to check
            if href not in urls: #if link has not already been seen
                urls.append(href) #add url to the list
                scrape_links(href) #recursive function call
    
    scrape_links("https://es.wikipedia.org/wiki/Olula_del_Río")
    

    You were pretty much there, just had to recursively call the function and check the links before function is called on them as well.

    This should work with various websites, with a few small changes.
    It should just work like this on any language of wikipedia.

    Edit: Be aware, my computer hit a recursion limit using this function, so it may just not be possible to recursively get every single link like this after a certain amount.

    Login or Signup to reply.
  2. Maybe recursive parse might work (but most probably you will encounter recursion limit), here is example of parser using asyncio that uses 16 workers (configurable) to scrap the wikipedia in parallel.

    I was able to get ~200k links in just few minutes (stored in out.txt).

    Note: you can cancel the program using Ctrl+C when you think you have enough links.

    import asyncio
    import aiohttp
    import urllib.parse
    from bs4 import BeautifulSoup
    
    main_queue = asyncio.Queue()
    parsed_links_queue = asyncio.Queue()
    parsed_links = set()
    
    session = None
    f_out = None
    visited_urls = 0
    
    async def get_url(url):
        global visited_urls
        try:
            async with session.get(url) as resp:
                visited_urls += 1
                return await resp.text()
        except:
            print(f'Bad URL: {url}')
    
    async def worker():
        while True:
            url = await main_queue.get()
            soup = BeautifulSoup(await get_url(url), 'html.parser')
    
            for a in soup.select('a[href]'):
                href = a['href']
                if href.startswith('/wiki/') and ':' not in href:
                    parsed_links_queue.put_nowait('https://es.wikipedia.org' + href)
    
            main_queue.task_done()
    
    async def consumer():
        while True:
            url = await parsed_links_queue.get()
    
            if url not in parsed_links:
                print(urllib.parse.unquote(url), file=f_out, flush=True)  # <-- print the url to file
                parsed_links.add(url)
                main_queue.put_nowait(url)
    
            parsed_links_queue.task_done()
    
    
    async def main():
        global session, f_out
    
        seed_url = 'https://es.wikipedia.org/wiki/Olula_del_R%C3%ADo'
        parsed_links.add(seed_url)
    
        with open('out.txt', 'w') as f_out:
            async with aiohttp.ClientSession() as session:
                workers = {asyncio.create_task(worker()) for _ in range(16)}
                c = asyncio.create_task(consumer())
    
                main_queue.put_nowait(seed_url)
                print('Initializing...')
                await asyncio.sleep(5)
    
                while main_queue.qsize():
                    print(f'Visited URLs: {visited_urls:>7}  Known URLs (saved in out.txt): {len(parsed_links):>7}', flush=True)
                    await asyncio.sleep(0.1)
    
        await main_queue.join()
        await parsed_links_queue.join()
    
    asyncio.run(main())
    

    This creates out.txt with following content:

    
    ...
    
    https://es.wikipedia.org/wiki/Eduardo_Asquerino
    https://es.wikipedia.org/wiki/Francisco_Luis_de_Retes
    https://es.wikipedia.org/wiki/Francisco_Pérez_Echevarría
    https://es.wikipedia.org/wiki/Pedro_Marquina
    https://es.wikipedia.org/wiki/Pedro_I_el_Cruel_(serie_de_televisión)
    https://es.wikipedia.org/wiki/Ramón_Madaula
    https://es.wikipedia.org/wiki/Pedro_Fernández_de_Castro
    https://es.wikipedia.org/wiki/Cañete
    https://es.wikipedia.org/wiki/Luis_Vicente_Díaz_Martín
    https://es.wikipedia.org/wiki/José_María_Montoto_López_Vigil
    https://es.wikipedia.org/wiki/José_Velázquez_y_Sánchez
    https://es.wikipedia.org/wiki/Revista_de_Ciencias,_Literatura_y_Artes_(1855-1860)
    
    ...
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search