How to find and extract a timestamp on a HTML-Website with Python/BeautifulSoup?

Craven
September 18, 2023
365 views
3 votes
3 Answers

I’m trying to write a bot that sends an email when there is a warning from the German Weather Service (Deutsche Wetterdienst DWD). This bot will be implemented in Python on my Raspberry Pi.

I want to extract some information from the DWD Website for, let’s say, Berlin. The URL could be
https://www.dwd.de/DE/wetter/warnungen_gemeinden/warnWetter_node.html?ort=Berlin-Mitte

First I want to extract the latest timestamp (https://phpout.com/wp-content/uploads/2023/09/Ge224-jpg.webp). When I examine the HTML information of this page, I find the corresponding id="HeaderBox" with the required timestamp (https://phpout.com/wp-content/uploads/2023/09/spMQm-jpg.webp).

Unfortunately, this date and time isn’t given when I pull the HTML code with Python. So here’s my code:

import requests
from bs4 import BeautifulSoup
url = "https://www.dwd.de/DE/wetter/warnungen_gemeinden/warnWetter_node.html?ort=Berlin-Mitte"
r = requests.get(url)
doc = BeautifulSoup(r.text, "html.parser")
doctext = doc.get_text()
print(doctext)

The result is always just "Letzte Aktualisierung: " and an "empty" line, even when I try last_date = doc.find(id="headerBox").

I am using the PyCharm IDE (community edition) and Python 3.11.

Any hints or ideas where to look are appreciated.

Best regards,
Christian

Answers

Chosen as BEST ANSWER
- Craven
- September 6, 2023 at 10:27 pm
- 0 votes
0
Thank you very much for your reply, it helped a lot. I got the solution close to your what you suggested. I installed chromium from the command line sudo apt-get install chromium-chromedriver and added the executable to the PATH environment as shown in 1.

Since Selenium has been updated, the arguments have changed a bit, as explained in 2.

The full code is now
```
from selenium import webdriver
from selenium.wendriver.chrome.service import Service
from selenium.webdriver.common.by import By
service = Service(executable_path='/usr/lib/chromium-browser/chromedriver')
options.add_argument('--headless=new')
driver = webdriver.Chrome(service=service, options=options)
url = "https://www.dwd.de/DE/wetter/warnungen_gemeinden/warnWetter_node.html?ort=Berlin-Mitte"
driver.get(url)
driver.implicitly_wait(10)
timestamp_element = driver.find_element(By.ID, "headerBox")
timestamp_text = timestamp_element.text
print(timestamp_text)
driver.quit()
```

(Edit)

- Corboss
- September 6, 2023 at 5:03 pm
- 0 votes
0
The issue you’re encountering might be related to the way the page is loaded or structured. Some websites use JavaScript to load dynamic content, and when you use requests to fetch the HTML content, you might not get the dynamically generated content.

To extract information from websites with dynamic content, you can use a headless browser automation tool like Selenium, which can interact with the webpage and retrieve the content after it’s fully loaded. Here’s how you can modify your script to use Selenium to extract the timestamp:

First, you need to install Selenium. You can do this using pip:
```
pip install selenium
```
Now, you can modify your script:
```
from selenium import webdriver
from selenium.webdriver.common.by import By

# Set up the Selenium web driver (you will need to download a compatible webdriver for your browser)
# For example, for Chrome, you can download the chromedriver: https://sites.google.com/chromium.org/driver/
driver = webdriver.Chrome(executable_path='/path/to/chromedriver')

url = "https://www.dwd.de/DE/wetter/warnungen_gemeinden/warnWetter_node.html?ort=Berlin-Mitte"
driver.get(url)

# Wait for the page to load (you may need to adjust the timeout)
driver.implicitly_wait(10)

# Find the element with the timestamp by its ID
timestamp_element = driver.find_element(By.ID, "headerBox")

# Extract the timestamp text
timestamp_text = timestamp_element.text

# Close the web driver
driver.quit()

# Print the timestamp
print("Timestamp:", timestamp_text)
```
Make sure to replace /path/to/chromedriver with the actual path to the Chrome WebDriver executable on your Raspberry Pi.

This script will open the webpage in a headless browser, wait for it to load, find the element with the timestamp by its ID, and then extract and print the timestamp.
Login or Signup to reply.

- Produdez
- September 6, 2023 at 5:26 pm
- 0 votes
0
I think you could look at the requests sent to the server to fetch warnings.
When inspecting the network, I saw these GET request URLs:
```
https://www.dwd.de/DWD/warnungen/warnapp_gemeinden/json/warnings_gemeinde.json?jsonp=loadWarnings......
```
The response was
```
warnWetter.loadWarnings({
    "time": 1694012878000,
    "warnings": [],
    "copyright": "..."
});
```
The timestamp right there corresponds to the exact time of last update (as I checked) and does not change when the website is reloaded (unless the last update also change).

You can format it and check for yourself. Hope this helps. Cheers.
```
new Date(1694012878000).toString()
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.