skip to Main Content

I am trying to do web scraping to acquire a list of international locations using this website: https://restaurants.subway.com

I am currently having this issue with my code where the href is being repeatedly added to my url through the "links" variable by my paste function. Is there a way where I can just have a href be added once to my url where it reaches the address page for each individual location?

This is my current code:

library(dplyr)
library(rvest)

subway <- "https://restaurants.subway.com"

addressParser <- function(url) {                                   
  links <- read_html(url) %>%                                                   
    html_elements(".Directory-listLink") %>%   
    html_attr("href")
    link <- paste(url, links, sep = "/")     
  if (url == "https://restaurants.subway.com/index.html") {
    dead_link <- paste(url, sep = "")
    NA <- dead_link

  } else if (length(link) == 0) { 

    link %>%
      html_element("#address") %>%
      html_text2()

   } else {
     c(lapply(link, addressParser), recursive = TRUE)     
  }
 }

 addressParser(subway)

This code results in "https://restaurants.subway.com/austria/austria/bu" but what I want is "https://restaurants.subway.com/austria/bu"

2

Answers


  1. Look at the value of links in the second level of recursion, using Austria for example:

    links <- read_html("https://restaurants.subway.com/austria") %>%                                                   
      html_elements(".Directory-listLink") %>%   
      html_attr("href")
    links
    
    #> [1] "austria/bu"                  
    #> [2] "austria/ka"                  
    #> [3] "austria/ni"                  
    #> [4] "austria/ob"                  
    #> [5] "austria/sa"                  
    #> [6] "austria/st"                  
    #> [7] "austria/ti"                  
    #> [8] "austria/vi/wien"             
    #> [9] "austria/additional-locations"
    

    These values already include "austria", so when you paste them after "https://restaurants.subway.com/austria" you get double "austria".

    You could fix this in a few ways, but here’s one by splitting the url argument into a root and path argument:

    addressParser <- function(root, path = "") {
      # combine root and path into a URL
      url <- paste(root, path, sep = "/")
      print(url)
      links <- read_html(url) %>%                                                   
        html_elements(".Directory-listLink") %>%   
        html_attr("href")
      # replace url with root to avoid duplication
      link <- paste(root, links, sep = "/")     
      if (url == "https://restaurants.subway.com/index.html") {
        dead_link <- paste(url, sep = "")
        NA <- dead_link
        
      } else if (length(link) == 0) { 
        
        link %>%
          html_element("#address") %>%
          html_text2()
        
      } else {
        # recursive call has two arguments now    
        c(lapply(links, function(p) addressParser(root, path = p)), recursive = TRUE) 
      }
    }
    
    addressParser(subway)
    
    Login or Signup to reply.
  2. If you can switch to rvest sessions, you don’t really have bother with absolute and relative URLs as session_jump_to() handles both.

    library(rvest)
    library(purrr)
    
    # rate-limited session_jump_to() with purrr::slowly()
    slowly_jump_to <- slowly(session_jump_to, rate_delay(.5))
    
    # use rvest session, create one if not provided
    addressParser <- function(rvest_session = NULL, url = "") {
      if (is.null(rvest_session)) {
        rvest_session <- session(url)
      } else {
        rvest_session <- slowly_jump_to(rvest_session, url)
      }
      
      message(sprintf("%dt%s", rvest_session$response$status_code, rvest_session$response$url))
      
      # tryCatch to handle cases where repsonse was not OK (status code >= 400);
      # some urls (e.g. for Germany) include unicode characters and 
      # xml2::url_absolute() can't handle those unless percent-encoded with URLencode()
      links <- 
        tryCatch(
          rvest_session %>%                                                   
            html_elements(".Directory-listLink") %>%   
            html_attr("href") %>%
            URLencode(),
          error = (e) NULL
        )
      
      if (length(links) == 0) { 
        tryCatch(
          rvest_session %>%
            html_element("address") %>%
            html_text2(),
          error = (e) NULL
        )
      } else {
        c(lapply(links, (url) addressParser(rvest_session, url)), recursive = TRUE)     
      }
    }
    
    res <- addressParser(url = "https://restaurants.subway.com/estonia")
    #> 200  https://restaurants.subway.com/estonia
    #> 200  https://restaurants.subway.com/estonia/tallinn/tartu-mtn-101
    #> 200  https://restaurants.subway.com/estonia/tartu/lounakeskus-ringtee-75
    res
    #> [1] "Tartu Mtn 101n10112 TallinnnEE"        
    #> [2] "Lounakeskus, Ringtee 75n51014 TartunEE"
    

    Memory usage did not explode. While that lapply() pattern is pretty neat, be aware that you’ll loose all results if something still crashes, so perhaps approach each country separately.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search