I’m doing some scraping, but as I’m parsing approximately 4000 URL’s, the website eventually detects my IP and blocks me every 20 iterations.
I’ve written a bunch of Sys.sleep(5)
and a tryCatch
so I’m not blocked too soon.
I use a VPN but I have to manually disconnect and reconnect it every now and then to change my IP. That’s not a suitable solution with such a scraper supposed to run all night long.
I think rotating a proxy should do the job.
Here’s my current code (a part of it at least) :
library(rvest)
library(dplyr)
scraped_data = data.frame()
for (i in urlsuffixes$suffix)
{
tryCatch({
message("Let's scrape that, Buddy !")
Sys.sleep(5)
doctolib_url = paste0("https://www.website.com/test/", i)
page = read_html(site_url)
links = page %>%
html_nodes(".seo-directory-doctor-link") %>%
html_attr("href")
Sys.sleep(5)
name = page %>%
html_nodes(".seo-directory-doctor-link") %>%
html_text()
Sys.sleep(5)
job_title = page %>%
html_nodes(".seo-directory-doctor-speciality") %>%
html_text()
Sys.sleep(5)
address = page %>%
html_nodes(".seo-directory-doctor-address") %>%
html_text()
Sys.sleep(5)
scraped_data = rbind(scraped_data, data.frame(links,
name,
address,
job_title,
stringsAsFactors = FALSE))
}, error=function(e){cat("Houston, we have a problem !","n",conditionMessage(e),"n")})
print(paste("Page : ", i))
}
2
Answers
Interesting question. I think the first thing to note is that, as mentioned on this Github issue,
rvest
andxml2
usehttr
for the connections. As such, I’m going to introducehttr
into this answer.Using a proxy with httr
The following code chunk shows how to use
httr
to query a url using a proxy and extract the html content.If you are using IP authentication or don’t need a username and password, you can simply exclude those values from the call.
In short, you can replace the
page = read_html(site_url)
with the code chunk above.Rotating the Proxies
One big problem with using proxies is getting reliable ones. For this, I’m just going to assume that you have a reliable source. Since you haven’t indicated otherwise, I’m going to assume that your proxies are stored in the following reasonable format with object name
proxies
:With that format in mind, you could tweak the script chunk above to rotate proxies for every web request as follows:
This may not be enough
You might want to go a few steps further and add elements to the
httr
request such as the user-agent etc. However, one of the big problems with a package likehttr
is that it can’t render dynamic html content, such as JavaScript-rendered html, and any website that really cares about blocking scrapers is going to detect this. To conquer this problem there are tools such as Headless Chrome that are meant to address specifically stuff like this. Here’s a package you might want to look into for headless Chrome in R NOTE: still in development.Disclaimer
Obviously, I think this code will work but since there’s no reproducible data to test with, it may not.
As already said by @Daniel-Molitor headless Chrome gives stunning results.
Another cheap option in R Studio is looping over a list of proxies while you have to start a new R process afterwards
Sys.sleep(1) can be even omitted afterwards 😉