Suppose I want to scrape the data (marathon running time) in this website: https://www.valenciaciudaddelrunning.com/en/marathon/2021-marathon-ranking/
One Google chrome when I right-click and select ‘Inspect’ or ‘View page source’, I don’t see the actual data embedded in the source page (e.g. I can see the name of athlete, the split times, etc on the browser, but the source code doesn’t contain any of those). I have tried web-scraping other websites where the data I need are embedded inside those tab, and using requests
and bs4
packages in Python I manage to extract the data I want from the websites. For the Valencia marathon URL posted above, is it possible to do web scraping, and if so how?
From some quick google search it looks like some webpages are dynamically loaded with Javascript (correct me if I’m wrong). Is that the case if the website appears to be interactive or if I don’t see the browser output when I inspect the source code? Is package like selenium
useful for the the above Valencia marathon URL? I know basically nothing about how websites are rendered so if someone can direct me to some useful resources that would be great.
2
Answers
There’s
<iframe>
in the page, so data you see in the browser is loaded from different URL:Prints:
Generally:
Be sure that the data you want to extract are not in an iframe, in such a case try to crawl the page URL of the iframe and not the URL of the page that includes it.
In your settings of your browser you can disable the JavaScript in order to easily find out if the data reach out to you in a transitional server side way or is downloaded in a dynamic JavaScript way. If such a case, the easy way is to go at the network tab of the browser and try to find out the Fetch/XHR request that has the data you may interested about.
In your question the information you need is in an iframe so you need to crawl that instead.