skip to Main Content

I’m doing some scraping, but as I’m parsing approximately 4000 URL’s, the website eventually detects my IP and blocks me every 20 iterations.

I’ve written a bunch of Sys.sleep(5) and a tryCatch so I’m not blocked too soon.

I use a VPN but I have to manually disconnect and reconnect it every now and then to change my IP. That’s not a suitable solution with such a scraper supposed to run all night long.

I think rotating a proxy should do the job.

Here’s my current code (a part of it at least) :

library(rvest)
library(dplyr)

scraped_data = data.frame()

for (i in urlsuffixes$suffix)
  {
  
  tryCatch({
    message("Let's scrape that, Buddy !")
    
    Sys.sleep(5)
 
    doctolib_url = paste0("https://www.website.com/test/", i)

    page = read_html(site_url)
    
    links = page %>%
      html_nodes(".seo-directory-doctor-link") %>%
      html_attr("href")
    
    Sys.sleep(5)
    
    name = page %>%
      html_nodes(".seo-directory-doctor-link") %>%
      html_text()
    
    Sys.sleep(5)
    
    job_title = page %>%
      html_nodes(".seo-directory-doctor-speciality") %>%
      html_text()
    
    Sys.sleep(5)
    
    address = page %>%
      html_nodes(".seo-directory-doctor-address") %>%
      html_text()
    
    Sys.sleep(5)
    
    scraped_data = rbind(scraped_data, data.frame(links,
                                                  name,
                                                  address,
                                                  job_title,
                                                  stringsAsFactors = FALSE))
    
  }, error=function(e){cat("Houston, we have a problem !","n",conditionMessage(e),"n")})
  print(paste("Page : ", i))
}

2

Answers


  1. Interesting question. I think the first thing to note is that, as mentioned on this Github issue, rvest and xml2 use httr for the connections. As such, I’m going to introduce httr into this answer.

    Using a proxy with httr

    The following code chunk shows how to use httr to query a url using a proxy and extract the html content.

    page <- httr::content(
        httr::GET(
            url, 
            httr::use_proxy(ip, port, username, password)
        )
    )
    

    If you are using IP authentication or don’t need a username and password, you can simply exclude those values from the call.

    In short, you can replace the page = read_html(site_url) with the code chunk above.

    Rotating the Proxies

    One big problem with using proxies is getting reliable ones. For this, I’m just going to assume that you have a reliable source. Since you haven’t indicated otherwise, I’m going to assume that your proxies are stored in the following reasonable format with object name proxies:

    ip port
    64.235.204.107 8080
    167.71.190.253 80
    185.156.172.122 3128

    With that format in mind, you could tweak the script chunk above to rotate proxies for every web request as follows:

    library(dplyr)
    library(httr)
    library(rvest)
    
    scraped_data = data.frame()
    
    for (i in 1:length(urlsuffixes$suffix))
      {
      
      tryCatch({
        message("Let's scrape that, Buddy !")
        
        Sys.sleep(5)
     
        doctolib_url = paste0("https://www.website.com/test/", 
                              urlsuffixes$suffix[[i]])
       
       # The number of urls is longer than the proxy list -- which proxy to use
       # I know this isn't the greatest, but it works so whatever
       proxy_id <- ifelse(i %% nrow(proxies) == 0, nrow(proxies), i %% nrow(proxies))
    
        page <- httr::content(
            httr::GET(
                doctolib_url, 
                httr::use_proxy(proxies$ip[[proxy_id]], proxies$port[[proxy_id]])
            )
        )
        
        links = page %>%
          html_nodes(".seo-directory-doctor-link") %>%
          html_attr("href")
        
        Sys.sleep(5)
        
        name = page %>%
          html_nodes(".seo-directory-doctor-link") %>%
          html_text()
        
        Sys.sleep(5)
        
        job_title = page %>%
          html_nodes(".seo-directory-doctor-speciality") %>%
          html_text()
        
        Sys.sleep(5)
        
        address = page %>%
          html_nodes(".seo-directory-doctor-address") %>%
          html_text()
        
        Sys.sleep(5)
        
        scraped_data = rbind(scraped_data, data.frame(links,
                                                      name,
                                                      address,
                                                      job_title,
                                                      stringsAsFactors = FALSE))
        
      }, error=function(e){cat("Houston, we have a problem !","n",conditionMessage(e),"n")})
      print(paste("Page : ", i))
    }
    

    This may not be enough

    You might want to go a few steps further and add elements to the httr request such as the user-agent etc. However, one of the big problems with a package like httr is that it can’t render dynamic html content, such as JavaScript-rendered html, and any website that really cares about blocking scrapers is going to detect this. To conquer this problem there are tools such as Headless Chrome that are meant to address specifically stuff like this. Here’s a package you might want to look into for headless Chrome in R NOTE: still in development.

    Disclaimer

    Obviously, I think this code will work but since there’s no reproducible data to test with, it may not.

    Login or Signup to reply.
  2. As already said by @Daniel-Molitor headless Chrome gives stunning results.
    Another cheap option in R Studio is looping over a list of proxies while you have to start a new R process afterwards

    Sys.setenv(http_proxy=proxy)
    .rs.restartR()
    

    Sys.sleep(1) can be even omitted afterwards 😉

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search