From the following site: https://www.ecb.europa.eu/press/pr/date/html/index.en.html
I am trying to create a program that based on a certain time period returns the URLS of the corresponding pages. The issue is that I am unable to access the dt and dd elements as I am having trouble with BeautifulSoup entering the dl element.
from bs4 import BeautifulSoup
import requests
url = "https://www.ecb.europa.eu/press/pr/date/html/index.en.html"
response = requests.get(url)
# Create a BeautifulSoup object from the response content
soup = BeautifulSoup(response.content, "html.parser")
# get access to only the main part of the site
main_wrapper_div = soup.find('div', {'id': 'main-wrapper'})
section_div = main_wrapper_div.find('dl', {'id': 'lazyload-container'})
I then want to search within section_div in order to get access to dates but section_div only stores the values corresponding to the dl line and not what is in it. see picture. I am only able to access the first arrow not the second.
2
Answers
So from what I see it seems that the site uses javascript to load the different elements, which is why when making a request using requests you only see the dl element as you’d need a browser to render that for you.
But you can use tools such as selenium to let the browser do the rendering for you. Here’s a brief little introduction if you want to look further into it here’s a guide.
Some content of website is loaded via separat calls, so you have to use these to get information while using
requests
. Take a closer look to your browsers dev tools in network section under XHR tab.Urls look like: https://www.ecb.europa.eu/press/pr/date/2023/html/index_include.en.html
So you could replace the year while iterating over all or specific ones. Select the
<dd>
elements and find its previous<dt>
to get your goal.Example
Output