How to access elements in a dl with beautiful soup? - Html

NerloFota
March 10, 2023
221 views
0 votes
2 Answers

From the following site: https://www.ecb.europa.eu/press/pr/date/html/index.en.html
I am trying to create a program that based on a certain time period returns the URLS of the corresponding pages. The issue is that I am unable to access the dt and dd elements as I am having trouble with BeautifulSoup entering the dl element.

from bs4 import BeautifulSoup
import requests

url = "https://www.ecb.europa.eu/press/pr/date/html/index.en.html"
response = requests.get(url)

# Create a BeautifulSoup object from the response content

soup = BeautifulSoup(response.content, "html.parser")

# get access to only the main part of the site

main_wrapper_div = soup.find('div', {'id': 'main-wrapper'})

section_div = main_wrapper_div.find('dl', {'id': 'lazyload-container'})

I then want to search within section_div in order to get access to dates but section_div only stores the values corresponding to the dl line and not what is in it. see picture. I am only able to access the first arrow not the second.

Answers

- NLion74
- March 10, 2023 at 4:56 pm
- 0 votes
0
So from what I see it seems that the site uses javascript to load the different elements, which is why when making a request using requests you only see the dl element as you’d need a browser to render that for you.

But you can use tools such as selenium to let the browser do the rendering for you. Here’s a brief little introduction if you want to look further into it here’s a guide.
```
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager

url = "https://www.ecb.europa.eu/press/pr/date/html/index.en.html"

driver=webdriver.Chrome(service=Service(ChromeDriverManager().install()))

wait = WebDriverWait(driver, 10)

driver.get(url)

get_url = driver.current_url
wait.until(EC.url_to_be(url))

page_source = driver.page_source

print(page_source)
```
Login or Signup to reply.

Some content of website is loaded via separat calls, so you have to use these to get information while using requests. Take a closer look to your browsers dev tools in network section under XHR tab.

Urls look like: https://www.ecb.europa.eu/press/pr/date/2023/html/index_include.en.html

So you could replace the year while iterating over all or specific ones. Select the <dd> elements and find its previous <dt> to get your goal.

Example

from bs4 import BeautifulSoup
import requests

url='https://www.ecb.europa.eu/press/pr/date/2023/html/index_include.en.html'
soup = BeautifulSoup(requests.get(url).text) 

data = []

for e in soup.select('dd'):
    data.append({
        'date': e.find_previous('dt').text,
        'title': e.a.text,
        'url': 'https://www.ecb.europa.eu'+e.a.get('href')
    })
data

Output

[{'date': '7 March 2023',
  'title': 'ECB Consumer Expectations Survey results – January 2023',
  'url': 'https://www.ecb.europa.eu/press/pr/date/2023/html/ecb.pr230307~938c254bd8.en.html'},
 {'date': '23 February 2023',
  'title': 'Financial statements of the ECB for 2022',
  'url': 'https://www.ecb.europa.eu/press/pr/date/2023/html/ecb.pr230223~398b74f1dc.en.html'},
 {'date': '23 February 2023',
  'title': 'Annual Accounts 2022',
  'url': 'https://www.ecb.europa.eu/pub/annual/annual-accounts/html/ecb.annualaccounts2022~ee9329bf6f.en.html'},
 {'date': '23 February 2023',
  'title': 'Consolidated balance sheet of the Eurosystem as at 31 December 2022',
  'url': 'https://www.ecb.europa.eu/pub/annual/balance/html/ecb.eurosystembalancesheet2022~4a2e481250.en.html'},...]

Please signup or login to give your own answer.

Click here to cancel reply.

How to access elements in a dl with beautiful soup? – Html

Answers

Example

Output