skip to Main Content

I have a webpage (https://deos.udel.edu/data/daily_retrieval.php) I want to extract data from. However, the data is precipitation data related to a specific selection made within the webpage. The specific selections include Station, and date. I am using the R package rvest and I am not sure if this data request can be done in R with rvest. Some of the source code of interest for the webpage may be found below.

        <label class="retsection" for="station">Station:</label><br>
        <select class="statlist" name="station" id="station" size="10">
<option class="select_input" value="DTBR" selected>Adamsville, DE-Taber</option>
<option class="select_input" value="DBUR">Angola, DE-Burton Pond</option>
<option class="select_input" value="DWHW">Atglen, PA-Wolfs Hollow</option>
<option class="select_input" value="DBBB">Bethany Beach, DE-Boardwalk</option>
<option class="select_input" value="DBNG">Bethany Beach, DE-NGTS</option>
<option class="select_input" value="DBKB">Blackbird, DE-NERR</option>
<option class="select_input" value="DBRG">Bridgeville, DE</option>

        <label class="retsection">Date:<br> </label>
        <select name='month' size='6' length='10'>
<option value='1'>January</option>
<option value='2'>February</option>
<option value='3'>March</option>
<option value='4'>April</option>
<option value='5'>May</option>
<option value='6'>June</option>
<option value='7'>July</option>
<option value='8' selected>August</option>
<option value='9'>September</option>
<option value='10'>October</option>
<option value='11'>November</option>
<option value='12'>December</option>
</select>
<select name='day' size='6' length='4'>
<option value='1'>1</option>
<option value='2' selected>2</option>
<option value='3'>3</option>
<option value='4'>4</option>
<option value='5'>5</option>
<option value='6'>6</option>

My initial thought is this task cannot be done since the precipitation data is not actively displayed on the webpage… the data pops up in a separate window after the selection is made. I have an access key provided by the webpage but am not 100% sure if it can be used to retrieve the large dataset I am wishing to pull.

  1. Is this type of data request feasible with rvest?
  2. What would be some suggested methods to extract large amounts of data via R. For example a years worth of precipitation data for a specific station of interest.

Thanks.

2

Answers


  1. This package is designed for static HTML scraping. but the precipitation data appears in a separate window or is loaded dynamically based on user selections.
    So you have to use a different approach.

    • Use the browser’s developer tools to inspect the network requests made when you select a station and date. Look for the request that retrieves the precipitation data. This may be an API endpoint or a form submission that you can replicate in R.
    • If you find a specific URL that retrieves the data, you can use the httr package in R to make GET requests. This allows you to programmatically specify the parameters to get desired data.
    library(httr)
    library(jsonlite)
    
    url <- "https://example.com/api"  
    params <- list(station = "DTBR", month = 8, day = 2) 
    response <- GET(url, query = params)
    data <- content(response, "parsed") 
    

    To retrieve a range of precipitation data without calling each day individually, you can loop through the desired date range and make requests programmatically

    library(httr)
    library(lubridate)  
    library(dplyr)   
    
    
    url <- "https://www.deos.udel.edu/odd-divas/station_daily.php?network=DEOS"
    station <- "DSMY"
    start_date <- as.Date("2024-07-01")
    end_date <- as.Date("2024-07-31")
    all_data <- list()
    for (date in seq.Date(start_date, end_date, by = "day")) {
      day <- day(date)
      month <- month(date)
      year <- year(date)  
      params <- list(station = station, month = month, day = day, year = year)
      response <- GET(url, query = params)
      if (status_code(response) == 200) {
        data <- content(response, as = "text")
        all_data[[as.character(date)]] <- data
      } else {
        message(paste("Failed to retrieve data for", date))
      }
    }
    print(all_data)
    
    
    Login or Signup to reply.
  2. Disclaimer:

    You probably don’t need to use this. DEOS have ways of downloading historical data as CSVs. Beyond that, make sure if you’re scraping the site, you leave some time between each request, otherwise you’ll be annoying the owners, and they’re likely to block you, or slow your responses down.

    Answer:

    The trick with this is that the parameters are included in the URL. So we only need to adjust those, in order to get a new result, as below:

    pacman::p_load(glue, rvest) # glue makes adding parameters to a string easier/cleaner
    
    url <- "https://deos.udel.edu/odd-divas/station_daily.php?network={network}&station={station}&month={m}&day={d}&year={y}"
    
    network <- "DEOS"
    station <- "DTBR"
    m <- 8
    d <- 3
    y <- 2024
    
    url <- glue(url)
    glue(url) |>
      read_html() |>
      html_table()
    

    Output:

    [[1]]
    # A tibble: 2 × 4
      X1       X2        X3        X4       
      <chr>    <chr>     <chr>     <chr>    
    1 Station  ""        Network   ""       
    2 Latitude "0° 0' S" Longitude "0° 0' W"
    
    [[2]]
    # A tibble: 24 × 20
        Hour Temp                  …¹ Temp                …² Dew Point           …³
       <int> <chr>                     <chr>                  <chr>                 
     1     0 N/A                       N/A                    N/A                   
     2     1 N/A                       N/A                    N/A                   
     3     2 N/A                       N/A                    N/A                   
     4     3 N/A                       N/A                    N/A                   
     5     4 N/A                       N/A                    N/A                   
     6     5 N/A                       N/A                    N/A                   
     7     6 N/A                       N/A                    N/A                   
     8     7 N/A                       N/A                    N/A                   
     9     8 N/A                       N/A                    N/A                   
    10     9 N/A                       N/A                    N/A                   
    # ℹ 14 more rows
    # ℹ abbreviated names: ¹​`Temp                  n(°F)`,
    #   ²​`Temp                  n(°C)`, ³​`Dew Point                  n(°F)`
    # ℹ 16 more variables: `Dew Point                  n(°C)` <chr>,
    #   `Rel Hum.                  n(%)` <chr>,
    #   `Wind Spd.                  n(MPH)` <chr>,
    #   `Wind Spd.                  n(m/s)` <chr>, …
    # ℹ Use `print(n = ...)` to see more rows
    
    [[3]]
    # A tibble: 1 × 11
      High Temp.                  n…¹ Low Temp.           …² Avg. Temp.          …³
      <chr>                            <chr>                  <chr>                 
    1 N/A                              N/A                    N/A                   
    # ℹ abbreviated names: ¹​`High Temp.                  n(°F)`,
    #   ²​`Low Temp.                  n(°F)`, ³​`Avg. Temp.                  n(°F)`
    # ℹ 8 more variables: `Avg. Dew Point                  n(°F)` <chr>,
    #   `Avg. Rel Hum                  n(%)` <chr>,
    #   `Avg. Wind Spd                  n(MPH)` <chr>,
    #   `Avg. Wind Dir                  n(°)` <chr>,
    #   `Peak Gust                  n(MPH)` <chr>, …
    
    [[4]]
    # A tibble: 1 × 1
      X1   
      <lgl>
    1 NA   
    
    [[5]]
    # A tibble: 3 × 1
      X1                                                                          
      <chr>                                                                       
    1 "Copyright © 2004-2024 DEOS"                                                
    2 "Please read the                   Data Disclaimern before using any data."
    3 "Questions or comments about this page? Click                   heren."
    

    To extend this to multiple days, you could use a for loop, or map(), or any number of other functions which do roughly the same thing. But without knowing for sure the information you are wanting from that site, I would say it’s highly likely you can get it from them in other ways.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search