skip to Main Content

Suppose I want to scrape the data (marathon running time) in this website: https://www.valenciaciudaddelrunning.com/en/marathon/2021-marathon-ranking/

One Google chrome when I right-click and select ‘Inspect’ or ‘View page source’, I don’t see the actual data embedded in the source page (e.g. I can see the name of athlete, the split times, etc on the browser, but the source code doesn’t contain any of those). I have tried web-scraping other websites where the data I need are embedded inside those tab, and using requests and bs4 packages in Python I manage to extract the data I want from the websites. For the Valencia marathon URL posted above, is it possible to do web scraping, and if so how?

From some quick google search it looks like some webpages are dynamically loaded with Javascript (correct me if I’m wrong). Is that the case if the website appears to be interactive or if I don’t see the browser output when I inspect the source code? Is package like selenium useful for the the above Valencia marathon URL? I know basically nothing about how websites are rendered so if someone can direct me to some useful resources that would be great.

2

Answers


  1. There’s <iframe> in the page, so data you see in the browser is loaded from different URL:

    from io import StringIO
    
    import pandas as pd
    import requests
    from bs4 import BeautifulSoup
    
    url = "https://resultados.valenciaciudaddelrunning.com/en/2021/maraton.php?y=2021"
    
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    
    table = soup.select_one("#tabPodium")
    
    df = pd.read_html(StringIO(str(table)))[0]
    print(df)
    

    Prints:

       Unnamed: 0                Name Official time Country
    0           1   CHERONO, LAWRENCE       2:05:12     KEN
    1           2         DESO, CHALU       2:05:16     ETH
    2           3  KACHERAN, PHILEMON       2:05:19     KEN
    
    Login or Signup to reply.
  2. Generally:

    • Be sure that the data you want to extract are not in an iframe, in such a case try to crawl the page URL of the iframe and not the URL of the page that includes it.

    • In your settings of your browser you can disable the JavaScript in order to easily find out if the data reach out to you in a transitional server side way or is downloaded in a dynamic JavaScript way. If such a case, the easy way is to go at the network tab of the browser and try to find out the Fetch/XHR request that has the data you may interested about.

    In your question the information you need is in an iframe so you need to crawl that instead.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search