skip to Main Content

I am trying to extract the date from a html node using R. This script used to work fine and I think perhaps the webpage has changed somewhat and now it returns N/A.

webpage = read_html('https://www.longpaddock.qld.gov.au/aussiegrass')
results = webpage %>% html_nodes("#graph-out-last-updated")
webpage_date_chr = html_text(results)
webpage_date_chr  ## This should print the date!!!

The image shows the node and the date, but I cannot extract the date.
enter image description here
Any help would be amazing!!

Cheers

2

Answers


  1. It looks like the page is a dynamic site, so the page needs to be loaded first before the data can be scraped. You’ll need to use a headless browser, like that used in the chromote package.

    I’ve edited your code to something that returned the date "21 June 2023" for me.

    library(tidyverse)
    library(chromote)
    library(rvest)
    
    chromote_scrape <- function(url) {
      b$Page$navigate(url)
      Sys.sleep(2)
      x <- b$DOM$getDocument()
      x <- b$DOM$querySelector(x$root$nodeId, "body")
      read_html(b$DOM$getOuterHTML(x$nodeId)$outerHTML)
    }
    
    b <- ChromoteSession$new()
    webpage = chromote_scrape('https://www.longpaddock.qld.gov.au/aussiegrass')
    results = webpage %>% html_nodes("#graph-out-last-updated")
    webpage_date_chr = html_text(results)
    webpage_date_chr  ## This should print the date!!!
    

    For more info, read section 25.7 of R4DS here: https://r4ds.hadley.nz/webscraping.html

    Login or Signup to reply.
  2. Those details are sourced from a single JSON file and guessing from the URL, that endpoint should be fairly stable: https://www.longpaddock.qld.gov.au/data/aussie-grass-graphs.json

    Structure itself is nothing special:
    enter image description here
    Yet not ideal for automatic rectangling to data.frame, here’s one example how one might approach this with purrr, tibble and tidyr:

    library(dplyr)
    library(tidyr)
    library(purrr)
    library(lubridate)
    
    grass_graphs <- jsonlite::fromJSON("https://www.longpaddock.qld.gov.au/data/aussie-grass-graphs.json")
    
    lga <- grass_graphs %>% 
      # turn list inside out, top level transforms from 
      # (ACT, NSW, NT, ...) to (lga, subibra)
      list_transpose() %>% 
      # work only with lga items
      pluck("lga") %>%
      # turn list into 2 column name-value tibble, values being lists
      tibble::enframe(name = "region", value = "lga_list") %>% 
      # unnest list columns to longer or wider, one level at a time
      unnest_longer(lga_list, indices_to = "lga") %>% 
      unnest_wider(lga_list) %>% 
      unnest_wider(gif:txt, names_sep = ".") %>% 
      # convert unix timestamps
      mutate(across(ends_with("date"), ~ as_datetime(.x) %>% as_date()))
    lga
    #> # A tibble: 555 × 8
    #>    region gif.size gif.date   pdf.size pdf.date   txt.size txt.date   lga       
    #>    <chr>     <int> <date>        <int> <date>        <int> <date>     <chr>     
    #>  1 ACT       44134 2018-07-31   174970 2023-06-20   286953 2023-06-20 Act       
    #>  2 NSW       43763 2018-07-31   176189 2023-06-20   286953 2023-06-20 AlburyCit…
    #>  3 NSW       44483 2018-07-31   176482 2023-06-20   286953 2023-06-20 ArmidaleR…
    #>  4 NSW       43769 2018-07-31   175293 2023-06-20   286953 2023-06-20 BallinaSh…
    #>  5 NSW       44956 2018-07-31   177756 2023-06-20   286953 2023-06-20 Balranald…
    #>  6 NSW       44267 2018-07-31   176878 2023-06-20   286953 2023-06-20 BathurstR…
    #>  7 NSW       43701 2018-07-31   173621 2023-06-20   286953 2023-06-20 BaysideCo…
    #>  8 NSW       44216 2018-07-31   176428 2023-06-20   286953 2023-06-20 BegaValle…
    #>  9 NSW       43996 2018-07-31   175817 2023-06-20   286953 2023-06-20 Bellingen…
    #> 10 NSW       43992 2018-07-31   176204 2023-06-20   286953 2023-06-20 BerriganS…
    #> # ℹ 545 more rows
    

    Extract file sizes and dates for a single Shire / LGA:

    lga %>% filter(stringr::str_detect(lga, "Aurukun"))
    #> # A tibble: 1 × 8
    #>   region gif.size gif.date   pdf.size pdf.date   txt.size txt.date   lga        
    #>   <chr>     <int> <date>        <int> <date>        <int> <date>     <chr>      
    #> 1 QLD       51200 2018-07-31   184589 2023-06-20   286953 2023-06-20 AurukunShi…
    
    

    Created on 2023-06-28 with reprex v2.0.2

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search