skip to Main Content

From the following site: https://www.ecb.europa.eu/press/pr/date/html/index.en.html
I am trying to create a program that based on a certain time period returns the URLS of the corresponding pages. The issue is that I am unable to access the dt and dd elements as I am having trouble with BeautifulSoup entering the dl element.

from bs4 import BeautifulSoup
import requests

url = "https://www.ecb.europa.eu/press/pr/date/html/index.en.html"
response = requests.get(url)

# Create a BeautifulSoup object from the response content

soup = BeautifulSoup(response.content, "html.parser")

# get access to only the main part of the site

main_wrapper_div = soup.find('div', {'id': 'main-wrapper'})

section_div = main_wrapper_div.find('dl', {'id': 'lazyload-container'})

I then want to search within section_div in order to get access to dates but section_div only stores the values corresponding to the dl line and not what is in it. see picture. I am only able to access the first arrow not the second.

2

Answers


  1. So from what I see it seems that the site uses javascript to load the different elements, which is why when making a request using requests you only see the dl element as you’d need a browser to render that for you.

    But you can use tools such as selenium to let the browser do the rendering for you. Here’s a brief little introduction if you want to look further into it here’s a guide.

    from selenium import webdriver
    from selenium.webdriver.chrome.service import Service
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from webdriver_manager.chrome import ChromeDriverManager
    
    url = "https://www.ecb.europa.eu/press/pr/date/html/index.en.html"
    
    driver=webdriver.Chrome(service=Service(ChromeDriverManager().install()))
    
    wait = WebDriverWait(driver, 10)
    
    driver.get(url)
    
    get_url = driver.current_url
    wait.until(EC.url_to_be(url))
    
    page_source = driver.page_source
    
    print(page_source)
    
    Login or Signup to reply.
  2. Some content of website is loaded via separat calls, so you have to use these to get information while using requests. Take a closer look to your browsers dev tools in network section under XHR tab.

    Urls look like: https://www.ecb.europa.eu/press/pr/date/2023/html/index_include.en.html

    So you could replace the year while iterating over all or specific ones. Select the <dd> elements and find its previous <dt> to get your goal.

    Example
    from bs4 import BeautifulSoup
    import requests
    
    url='https://www.ecb.europa.eu/press/pr/date/2023/html/index_include.en.html'
    soup = BeautifulSoup(requests.get(url).text) 
    
    data = []
    
    for e in soup.select('dd'):
        data.append({
            'date': e.find_previous('dt').text,
            'title': e.a.text,
            'url': 'https://www.ecb.europa.eu'+e.a.get('href')
        })
    data
    
    Output
    [{'date': '7 March 2023',
      'title': 'ECB Consumer Expectations Survey results – January 2023',
      'url': 'https://www.ecb.europa.eu/press/pr/date/2023/html/ecb.pr230307~938c254bd8.en.html'},
     {'date': '23 February 2023',
      'title': 'Financial statements of the ECB for 2022',
      'url': 'https://www.ecb.europa.eu/press/pr/date/2023/html/ecb.pr230223~398b74f1dc.en.html'},
     {'date': '23 February 2023',
      'title': 'Annual Accounts 2022',
      'url': 'https://www.ecb.europa.eu/pub/annual/annual-accounts/html/ecb.annualaccounts2022~ee9329bf6f.en.html'},
     {'date': '23 February 2023',
      'title': 'Consolidated balance sheet of the Eurosystem as at 31 December 2022',
      'url': 'https://www.ecb.europa.eu/pub/annual/balance/html/ecb.eurosystembalancesheet2022~4a2e481250.en.html'},...]
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search