Html - Troubleshooting recursive function dealing with RVEST

DawonKim
November 13, 2024
114 views
0 votes
2 Answers

I am trying to do web scraping to acquire a list of international locations using this website: https://restaurants.subway.com

I am currently having this issue with my code where the href is being repeatedly added to my url through the "links" variable by my paste function. Is there a way where I can just have a href be added once to my url where it reaches the address page for each individual location?

This is my current code:

library(dplyr)
library(rvest)

subway <- "https://restaurants.subway.com"

addressParser <- function(url) {                                   
  links <- read_html(url) %>%                                                   
    html_elements(".Directory-listLink") %>%   
    html_attr("href")
    link <- paste(url, links, sep = "/")     
  if (url == "https://restaurants.subway.com/index.html") {
    dead_link <- paste(url, sep = "")
    NA <- dead_link

  } else if (length(link) == 0) { 

    link %>%
      html_element("#address") %>%
      html_text2()

   } else {
     c(lapply(link, addressParser), recursive = TRUE)     
  }
 }

 addressParser(subway)

This code results in "https://restaurants.subway.com/austria/austria/bu" but what I want is "https://restaurants.subway.com/austria/bu"

Tags: html r recursion web-scraping

Answers

Look at the value of links in the second level of recursion, using Austria for example:

links <- read_html("https://restaurants.subway.com/austria") %>%                                                   
  html_elements(".Directory-listLink") %>%   
  html_attr("href")
links

#> [1] "austria/bu"                  
#> [2] "austria/ka"                  
#> [3] "austria/ni"                  
#> [4] "austria/ob"                  
#> [5] "austria/sa"                  
#> [6] "austria/st"                  
#> [7] "austria/ti"                  
#> [8] "austria/vi/wien"             
#> [9] "austria/additional-locations"

These values already include "austria", so when you paste them after "https://restaurants.subway.com/austria" you get double "austria".

You could fix this in a few ways, but here’s one by splitting the url argument into a root and path argument:

addressParser <- function(root, path = "") {
  # combine root and path into a URL
  url <- paste(root, path, sep = "/")
  print(url)
  links <- read_html(url) %>%                                                   
    html_elements(".Directory-listLink") %>%   
    html_attr("href")
  # replace url with root to avoid duplication
  link <- paste(root, links, sep = "/")     
  if (url == "https://restaurants.subway.com/index.html") {
    dead_link <- paste(url, sep = "")
    NA <- dead_link
    
  } else if (length(link) == 0) { 
    
    link %>%
      html_element("#address") %>%
      html_text2()
    
  } else {
    # recursive call has two arguments now    
    c(lapply(links, function(p) addressParser(root, path = p)), recursive = TRUE) 
  }
}

addressParser(subway)

If you can switch to rvest sessions, you don’t really have bother with absolute and relative URLs as session_jump_to() handles both.

library(rvest)
library(purrr)

# rate-limited session_jump_to() with purrr::slowly()
slowly_jump_to <- slowly(session_jump_to, rate_delay(.5))

# use rvest session, create one if not provided
addressParser <- function(rvest_session = NULL, url = "") {
  if (is.null(rvest_session)) {
    rvest_session <- session(url)
  } else {
    rvest_session <- slowly_jump_to(rvest_session, url)
  }
  
  message(sprintf("%dt%s", rvest_session$response$status_code, rvest_session$response$url))
  
  # tryCatch to handle cases where repsonse was not OK (status code >= 400);
  # some urls (e.g. for Germany) include unicode characters and 
  # xml2::url_absolute() can't handle those unless percent-encoded with URLencode()
  links <- 
    tryCatch(
      rvest_session %>%                                                   
        html_elements(".Directory-listLink") %>%   
        html_attr("href") %>%
        URLencode(),
      error = (e) NULL
    )
  
  if (length(links) == 0) { 
    tryCatch(
      rvest_session %>%
        html_element("address") %>%
        html_text2(),
      error = (e) NULL
    )
  } else {
    c(lapply(links, (url) addressParser(rvest_session, url)), recursive = TRUE)     
  }
}

res <- addressParser(url = "https://restaurants.subway.com/estonia")
#> 200  https://restaurants.subway.com/estonia
#> 200  https://restaurants.subway.com/estonia/tallinn/tartu-mtn-101
#> 200  https://restaurants.subway.com/estonia/tartu/lounakeskus-ringtee-75
res
#> [1] "Tartu Mtn 101n10112 TallinnnEE"        
#> [2] "Lounakeskus, Ringtee 75n51014 TartunEE"

Memory usage did not explode. While that lapply() pattern is pretty neat, be aware that you’ll loose all results if something still crashes, so perhaps approach each country separately.

Please signup or login to give your own answer.

Click here to cancel reply.

Html – Troubleshooting recursive function dealing with RVEST

Answers