skip to Main Content

I would like to extract data from the the table on this website using R for all available dates, times and states. I’m new to web scraping so I’m struggling to figure out what to extract, especially since there are dropdown menus involved in selecting the date/time/state.

So far I’ve managed to extract the dates, times and states, but I don’t know how to use this information to extract the actual data from the tables.

url <- "https://apims.doe.gov.my/api_table.html"
webpage <- read_html(url)

dates <- webpage %>%
  html_node("#pickdate") %>%
  html_nodes("option") %>%
  html_text()

times <- webpage %>%
  html_node("#picktime") %>%
  html_nodes("option") %>%
  html_text()

all_state <- webpage %>%
  html_node("#pickstate") %>%
  html_nodes("option[value='ALL']") %>%
  html_text()

2

Answers


  1. library(httr)
    library(jsonlite)
    library(tidyr)
    
    URL = "https://apims.doe.gov.my/data/public_v2/CAQM/last24hours.json"
    
    
    res <- httr::GET(url = URL)
    
    # Get table as matrix.
    table <- fromJSON(content(res, as = "text"))[["24hour_api_apims"]]
    # Convert to data frame.
    table <- data.frame(table[-1,]) |>
      setNames(table[1,])
    
    # Data should all be there... but it would be better to have it in long format.
    
    pivot_longer(table, c(-"State", -"Location"), names_to = "time")
    

    Results in long format:

    # A tibble: 1,560 × 4
       State  Location time    value
       <chr>  <chr>    <chr>   <chr>
     1 PERLIS Kangar   6:00PM  56** 
     2 PERLIS Kangar   7:00PM  57** 
     3 PERLIS Kangar   8:00PM  56** 
     4 PERLIS Kangar   9:00PM  56** 
     5 PERLIS Kangar   10:00PM 56** 
     6 PERLIS Kangar   11:00PM 56** 
     7 PERLIS Kangar   12:00AM 56** 
     8 PERLIS Kangar   1:00AM  55** 
     9 PERLIS Kangar   2:00AM  55** 
    10 PERLIS Kangar   3:00AM  55** 
    # ℹ 1,550 more rows
    # ℹ Use `print(n = ...)` to see more rows
    

    If you want the data in "wide" format then omit last step.

    Login or Signup to reply.
  2. Instead of using SelectorGadget or element inspector in your browser’s dev.tools, that operates on javascript-rendered DOM tree, you might want to start with checking the actual page source as this is what’s accessible through read_html(), it can be quite different from what you see in object inspector.

    E.g. open view-source:https://apims.doe.gov.my/api_table.html in your browser, in this case it’s quite compact and nicely formatted, perfect for learning.

    From there it should be clear that those tables are built dynamically and data is not part of the page source (i.e. not accessible through read_html(url) ). If you now switch to network tab of browser’s dev tools and fiddle with form controls to load measurements for different dates, times and states, you should see requests to API endpoints that actually serve that data. You should also notice that with each parameter change, 2 request are made, one for stationary and another for mobile stations (CAQM / MCAQM). And it always delivers data for 24 hours and all states.

    You also might want to check sourced JavaScript, in this particular case it’s not minified and is easy to read, providing more insights of what is going on behind the scenes: js/public_UI.js & js/data_table2.js

    Long story short, instead of scraping, just generate those request yourself and parse returned JSONs:

    library(jsonlite)
    library(stringr)
    library(dplyr)
    library(tidyr)
    library(purrr)
    
    # glue template
    apims_tmpl <- "https://apims.doe.gov.my/data/public_v2/{station_type_endpoint}/{meas_date}/0000.json"
    
    # get JSON, convert to tibble
    get_apims <- function(api_url){
      fromJSON(api_url)[[1]] %>% 
        # matrix, first row holds column names, 
        # set first row to colnames before converting to tibble
        { `colnames<-`(.[-1,], .[1,]) } |>
        as_tibble()
    }
    
    # add some rate limiting, pause for 1s between requests
    slowly_get_apims <- slowly(get_apims, rate = rate_delay(1))
    
    # generate a list of dates, let's pick a 2-day range that's not accessible
    # though the form
    apims <- 
      seq(as.Date("2024/05/01", "%Y/%m/%d"), 
          as.Date("2024/05/02", "%Y/%m/%d"), "days") |>
      strftime("%Y/%m/%d") |> 
      # a frame with combinations of dates and CAQM / MCAQM
      expand_grid(meas_date = _ ,station_type_endpoint = c("CAQM/hours24", "MCAQM/mcaqmhours24")) |>
      # call slowly_get_apims() on each url that is built from 
      # apims_tmpl & values in columns meas_date & station_type_endpoint;
      # returns nested tibble (response tibble for every row)
      mutate(apims = map(str_glue(apims_tmpl), slowly_get_apims, .progress = TRUE)) |>
      # unnest
      unnest(apims) 
    
    #> ■■■■■■■■■■■■■■■■■■■■■■■ 75% | ETA: 3s
    

    Result:

    apims
    #> # A tibble: 136 × 28
    #>    meas_date  station_type_endpoint State    Location `1:00AM` `2:00AM` `3:00AM`
    #>    <chr>      <chr>                 <chr>    <chr>    <chr>    <chr>    <chr>   
    #>  1 2024/05/01 CAQM/hours24          PERLIS   Kangar   52**     52**     53**    
    #>  2 2024/05/01 CAQM/hours24          KEDAH    Langkawi 49**     48**     47**    
    #>  3 2024/05/01 CAQM/hours24          KEDAH    Alor Se… 69**     69**     69**    
    #>  4 2024/05/01 CAQM/hours24          KEDAH    Sungai … 61**     62**     63**    
    #>  5 2024/05/01 CAQM/hours24          KEDAH    Kulim H… 64**     64**     64**    
    #>  6 2024/05/01 CAQM/hours24          PULAU P… Seberan… 65**     65**     65**    
    #>  7 2024/05/01 CAQM/hours24          PULAU P… Seberan… 63**     63**     62**    
    #>  8 2024/05/01 CAQM/hours24          PULAU P… Minden   65**     65**     65**    
    #>  9 2024/05/01 CAQM/hours24          PULAU P… Balik P… 68**     69**     69**    
    #> 10 2024/05/01 CAQM/hours24          PERAK    Taiping  69**     69**     68**    
    #> # ℹ 126 more rows
    #> # ℹ 21 more variables: `4:00AM` <chr>, `5:00AM` <chr>, `6:00AM` <chr>,
    #> #   `7:00AM` <chr>, `8:00AM` <chr>, `9:00AM` <chr>, `10:00AM` <chr>,
    #> #   `11:00AM` <chr>, `12:00PM` <chr>, `1:00PM` <chr>, `2:00PM` <chr>,
    #> #   `3:00PM` <chr>, `4:00PM` <chr>, `5:00PM` <chr>, `6:00PM` <chr>,
    #> #   `7:00PM` <chr>, `8:00PM` <chr>, `9:00PM` <chr>, `10:00PM` <chr>,
    #> #   `11:00PM` <chr>, `12:00AM` <chr>
    glimpse(apims)
    #> Rows: 136
    #> Columns: 28
    #> $ meas_date             <chr> "2024/05/01", "2024/05/01", "2024/05/01", "2024/…
    #> $ station_type_endpoint <chr> "CAQM/hours24", "CAQM/hours24", "CAQM/hours24", …
    #> $ State                 <chr> "PERLIS", "KEDAH", "KEDAH", "KEDAH", "KEDAH", "P…
    #> $ Location              <chr> "Kangar", "Langkawi", "Alor Setar", "Sungai Peta…
    #> $ `1:00AM`              <chr> "52**", "49**", "69**", "61**", "64**", "65**", …
    #> $ `2:00AM`              <chr> "52**", "48**", "69**", "62**", "64**", "65**", …
    #> $ `3:00AM`              <chr> "53**", "47**", "69**", "63**", "64**", "65**", …
    #> $ `4:00AM`              <chr> "54**", "47**", "69**", "63**", "64**", "65**", …
    #> $ `5:00AM`              <chr> "54**", "46**", "70**", "63**", "65**", "65**", …
    #> $ `6:00AM`              <chr> "54**", "46**", "70**", "63**", "65**", "65**", …
    #> $ `7:00AM`              <chr> "55**", "47**", "70**", "63**", "65**", "65**", …
    #> $ `8:00AM`              <chr> "55**", "46**", "70**", "63**", "65**", "65**", …
    #> $ `9:00AM`              <chr> "56**", "47**", "70**", "63**", "65**", "66**", …
    #> $ `10:00AM`             <chr> "56**", "47**", "71**", "63**", "65**", "67**", …
    #> $ `11:00AM`             <chr> "56**", "48**", "71**", "64**", "65**", "67**", …
    #> $ `12:00PM`             <chr> "56**", "50**", "72**", "65**", "64**", "68**", …
    #> $ `1:00PM`              <chr> "56**", "51**", "72**", "66**", "64**", "67**", …
    #> $ `2:00PM`              <chr> "56**", "51**", "72**", "66**", "65**", "67**", …
    #> $ `3:00PM`              <chr> "55**", "51**", "73**", "66**", "64**", "67**", …
    #> $ `4:00PM`              <chr> "56**", "51**", "73**", "66**", "64**", "66**", …
    #> $ `5:00PM`              <chr> "55**", "50**", "73**", "66**", "64**", "67**", …
    #> $ `6:00PM`              <chr> "55**", "51**", "73**", "65**", "63**", "67**", …
    #> $ `7:00PM`              <chr> "56**", "51**", "73**", "65**", "62**", "67**", …
    #> $ `8:00PM`              <chr> "55**", "51**", "73**", "65**", "62**", "68**", …
    #> $ `9:00PM`              <chr> "55**", "51**", "74**", "64**", "62**", "66**", …
    #> $ `10:00PM`             <chr> "56**", "51**", "74**", "62**", "61**", "65**", …
    #> $ `11:00PM`             <chr> "56**", "52**", "74**", "62**", "60**", "64**", …
    #> $ `12:00AM`             <chr> "56**", "52**", "74**", "61**", "60**", "63**", …
    

    Created on 2024-06-07 with reprex v2.1.0

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search