I am trying to do web scraping to acquire a list of international locations using this website: https://restaurants.subway.com
I am currently having this issue with my code where the href is being repeatedly added to my url through the "links" variable by my paste function. Is there a way where I can just have a href be added once to my url where it reaches the address page for each individual location?
This is my current code:
library(dplyr)
library(rvest)
subway <- "https://restaurants.subway.com"
addressParser <- function(url) {
links <- read_html(url) %>%
html_elements(".Directory-listLink") %>%
html_attr("href")
link <- paste(url, links, sep = "/")
if (url == "https://restaurants.subway.com/index.html") {
dead_link <- paste(url, sep = "")
NA <- dead_link
} else if (length(link) == 0) {
link %>%
html_element("#address") %>%
html_text2()
} else {
c(lapply(link, addressParser), recursive = TRUE)
}
}
addressParser(subway)
This code results in "https://restaurants.subway.com/austria/austria/bu" but what I want is "https://restaurants.subway.com/austria/bu"
2
Answers
Look at the value of
links
in the second level of recursion, using Austria for example:These values already include
"austria"
, so when you paste them after"https://restaurants.subway.com/austria"
you get double"austria"
.You could fix this in a few ways, but here’s one by splitting the
url
argument into aroot
andpath
argument:If you can switch to
rvest
sessions, you don’t really have bother with absolute and relative URLs assession_jump_to()
handles both.Memory usage did not explode. While that
lapply()
pattern is pretty neat, be aware that you’ll loose all results if something still crashes, so perhaps approach each country separately.