I am trying to extract the date from a html node using R. This script used to work fine and I think perhaps the webpage has changed somewhat and now it returns N/A.
webpage = read_html('https://www.longpaddock.qld.gov.au/aussiegrass')
results = webpage %>% html_nodes("#graph-out-last-updated")
webpage_date_chr = html_text(results)
webpage_date_chr ## This should print the date!!!
The image shows the node and the date, but I cannot extract the date.
Any help would be amazing!!
Cheers
2
Answers
It looks like the page is a dynamic site, so the page needs to be loaded first before the data can be scraped. You’ll need to use a headless browser, like that used in the
chromote
package.I’ve edited your code to something that returned the date "21 June 2023" for me.
For more info, read section 25.7 of R4DS here: https://r4ds.hadley.nz/webscraping.html
Those details are sourced from a single JSON file and guessing from the URL, that endpoint should be fairly stable:
https://www.longpaddock.qld.gov.au/data/aussie-grass-graphs.json
Structure itself is nothing special:
Yet not ideal for automatic rectangling to data.frame, here’s one example how one might approach this with
purrr
,tibble
andtidyr
:Extract file sizes and dates for a single Shire / LGA:
Created on 2023-06-28 with reprex v2.0.2