In R, I am trying to webscrape the all working paper # (e.g, 31424, 31481, etc) of the following webpage:
I trying to run the following code to get such:
url<-"https://www.nber.org/papers?facet=topics%3AFinancial%20Economics&page=1&perPage=50&sortBy=public_date"
page=read_html(url)
name=page%>%html_nodes(".paper-card__paper_number")%>%html_text()
However, this code returns character(0), NOT giving me the working paper #’s. Is there any way I can modify this code to get the working paper #’s?
2
Answers
To scrape dynamically generated content, you can use a headless browser automation tool like RSelenium, which allows you to control a real web browser programmatically. Here’s how you can modify your code to achieve this:
1.First, make sure you have RSelenium and rvest installed:
2.Load the required libraries:
3.Start a Selenium server and open a browser:
4.Navigate to the desired URL:
5.Get the working paper numbers:
6.Stop the Selenium server and close the browser:
Another alternative to Selenium is to query NBER’s restful API, which will return a rather simple json, with a data.frame like object that holds not just the Working Paper number but a lot of usefull information, like authors, title, date, etc…
Accessing an API is much faster than recurring to Selenium because the server returns much less data to the clients.
The API allows you to paginate, with each query returning up to 100 results. You can get the API’s url by inspecting the network traffic of your web browser session.
If you need to capture a second page, then paginate by modifying the URL.