I would like to extract data from the the table on this website using R for all available dates, times and states. I’m new to web scraping so I’m struggling to figure out what to extract, especially since there are dropdown menus involved in selecting the date/time/state.
So far I’ve managed to extract the dates, times and states, but I don’t know how to use this information to extract the actual data from the tables.
url <- "https://apims.doe.gov.my/api_table.html"
webpage <- read_html(url)
dates <- webpage %>%
html_node("#pickdate") %>%
html_nodes("option") %>%
html_text()
times <- webpage %>%
html_node("#picktime") %>%
html_nodes("option") %>%
html_text()
all_state <- webpage %>%
html_node("#pickstate") %>%
html_nodes("option[value='ALL']") %>%
html_text()
2
Answers
Results in long format:
If you want the data in "wide" format then omit last step.
Instead of using SelectorGadget or element inspector in your browser’s dev.tools, that operates on javascript-rendered DOM tree, you might want to start with checking the actual page source as this is what’s accessible through
read_html()
, it can be quite different from what you see in object inspector.E.g. open
view-source:https://apims.doe.gov.my/api_table.html
in your browser, in this case it’s quite compact and nicely formatted, perfect for learning.From there it should be clear that those tables are built dynamically and data is not part of the page source (i.e. not accessible through
read_html(url)
). If you now switch to network tab of browser’s dev tools and fiddle with form controls to load measurements for different dates, times and states, you should see requests to API endpoints that actually serve that data. You should also notice that with each parameter change, 2 request are made, one for stationary and another for mobile stations (CAQM / MCAQM). And it always delivers data for 24 hours and all states.You also might want to check sourced JavaScript, in this particular case it’s not minified and is easy to read, providing more insights of what is going on behind the scenes: js/public_UI.js & js/data_table2.js
Long story short, instead of scraping, just generate those request yourself and parse returned JSONs:
Result:
Created on 2024-06-07 with reprex v2.1.0