Html - Web Scrape Numbers in R?

JamesRider
July 26, 2023
306 views
1 vote
2 Answers

In R, I am trying to webscrape the all working paper # (e.g, 31424, 31481, etc) of the following webpage:

https://www.nber.org/papers?facet=topics%3AFinancial%20Economics&page=1&perPage=50&sortBy=public_date

I trying to run the following code to get such:

url<-"https://www.nber.org/papers?facet=topics%3AFinancial%20Economics&page=1&perPage=50&sortBy=public_date"
page=read_html(url)
name=page%>%html_nodes(".paper-card__paper_number")%>%html_text()

However, this code returns character(0), NOT giving me the working paper #’s. Is there any way I can modify this code to get the working paper #’s?

Tags: html r rvest web-scraping

Answers

- PrashantPatil
- July 25, 2023 at 5:04 am
- 0 votes
0
To scrape dynamically generated content, you can use a headless browser automation tool like RSelenium, which allows you to control a real web browser programmatically. Here’s how you can modify your code to achieve this:

1.First, make sure you have RSelenium and rvest installed:
```
install.packages("RSelenium")
install.packages("rvest")
```
2.Load the required libraries:
```
library(RSelenium)
library(rvest)
```
3.Start a Selenium server and open a browser:
```
driver <- rsDriver(browser="chrome", chromever="latest", port=4567L)
remDr <- driver[["client"]]
```
4.Navigate to the desired URL:
```
url <- "https://www.nber.org/papers?facet=topics%3AFinancial%20Economics&page=1&perPage=50&sortBy=public_date"
remDr$navigate(url)
```
5.Get the working paper numbers:
```
page_source <- remDr$getPageSource()[[1]]
page <- read_html(page_source)
name <- page %>% html_nodes(".paper-card__paper_number") %>% html_text()
```
6.Stop the Selenium server and close the browser:
```
remDr$close()
driver$server$stop()
```
Login or Signup to reply.

Another alternative to Selenium is to query NBER’s restful API, which will return a rather simple json, with a data.frame like object that holds not just the Working Paper number but a lot of usefull information, like authors, title, date, etc…
Accessing an API is much faster than recurring to Selenium because the server returns much less data to the clients.

The API allows you to paginate, with each query returning up to 100 results. You can get the API’s url by inspecting the network traffic of your web browser session.

library(dplyr)
library(jsonlite)
    
url_to_json <- "https://www.nber.org/api/v1/working_page_listing/contentType/working_paper/_/_/search?facet=topics%3AFinancial%20Economics&page=1&perPage=100&sortBy=public_date"
json_p01    <- fromJSON(txt = url_to_json) 

df_p01 <- as_tibble(json_p01$results) |> 
          mutate(wp_id = sub(pattern = "^.*papers[/]w", replacement = "", url))

df_p01 |> select(displaydate, title, wp_id, abstract)
# A tibble: 100 × 4
   displaydate title                                      wp_id abstract
   <chr>       <chr>                                      <chr> <chr>   
 1 July 2023   The Impact of Money in Politics on Labor … 31481 We exam…
 2 July 2023   Aggregate Lending and Modern Financial In… 31484 Existin…
 3 July 2023   Financial Machine Learning                 31502 We surv…
 4 July 2023   Housing, Household Debt, and the Business… 31489 China a…
 5 July 2023   Selection-Neglect in the NFT Bubble        31498 Using t…
 6 July 2023   Social Security Claiming Intentions: Psyc… 31499 For man…
 7 July 2023   Sparse Modeling Under Grouped Heterogenei… 31424 Sparse …
 8 July 2023   Firms with Benefits? Nonwage Compensation… 31463 Using a…
 9 July 2023   The Credit Supply Channel of Monetary Pol… 31464 This pa…
10 July 2023   Bank Branch Density and Bank Runs          31462 Bank br…
# ℹ 90 more rows
# ℹ Use `print(n = ...)` to see more rows

If you need to capture a second page, then paginate by modifying the URL.

library(urltools)
url_to_json_page_2 <- urltools::param_set(urls = url_to_json,  key = "page", value = 2)

Please signup or login to give your own answer.

Click here to cancel reply.

Html – Web Scrape Numbers in R?

Answers

If you need to capture a second page, then paginate by modifying the URL.