web scraping with loop - Html

AlmaJ
February 27, 2023
186 views
1 vote
2 Answers

I’m trying to download zip codes that are in different pages. I started with a list of nodes for each municipality inside Mexico City.

url<-"https://www.codigopostal.lat/mexico/Ciudad-de-Mexico/"
resource<-GET(url)
parse<-htmlParse(resource)
links<-as.character(xpathSApply(parse,path="//a",xmlGetAttr,"href"))
print(links)

And then I’m trying to create a loop that grabs each url and grabs the table of zip codes to later create a big corpus of each matrix created per municipality:

scraper<-function(url){
  html<-read_html(url)
  tabla<-html%>%
    html_elements("td , th") %>%
    html_text2()
  data<-matrix(ncol=3,nrow=length(tabla))
  data<-data.frame(matrix(tabla,nrow=length(tabla),ncol=3,byrow=TRUE)) %>% 
    row_to_names(row_number=1)
}

I will have "municipality", "locality", "zp", that’s why the number of columns is 3, but it seems that:
"Error: x must be a string of length 1" and I also cannot add up all the matrices.
Any ideas are greatly appreciated!

Tags: html loops r web-scraping

Answers

- RuiBarradas
- February 27, 2023 at 8:20 am
- 0 votes
0
Here is a way to scrape the zip codes of Ciudad-de-Mexico.
```
suppressPackageStartupMessages({
  library(rvest)
  library(magrittr)
})

scraper <- function(link) {
  link %>%
    read_html() %>%
    html_table() %>%
    `[[`(1)
}

url <- "https://www.codigopostal.lat/mexico/Ciudad-de-Mexico/"

page <- read_html(url)
zip_codes_list <- page %>%
  html_elements("a") %>%
  html_attr("href") %>%
  grep("mexico/Ciudad-de-Mexico/.+", ., value = TRUE) %>%
  lapply(scraper)
```
Then rbind them all together.
```
zip_codes <- do.call(rbind, zip_codes_list)
```
Edit

In the original post I have loaded package dplyr. After a second thought I have realized that it’s only loaded to make the magrittr pipe operator available, so I have changed the code to only load the relevant package, magrittr.
Login or Signup to reply.

library(tidyverse)
library(rvest)

"https://www.codigopostal.lat/mexico/Ciudad-de-Mexico/" %>%  
  read_html() %>%  
  html_elements(".ctrLink a") %>%  
  html_attr("href") %>% # Grab all links from the main URL 
  map_dfr(~ .x %>% # Map through municipalities, scrape tables and row bind
            read_html() %>% 
            html_table())

# Consider janitor::clean_names() at the end


# A tibble: 2,014 × 3
   Municipio      Localidad                           `Código Postal`
   <chr>          <chr>                                         <int>
 1 Alvaro Obregon 1a Ampliación Presidentes                      1299
 2 Alvaro Obregon 1a Sección Cañada                              1269
 3 Alvaro Obregon 1a Victoria                                    1160
 4 Alvaro Obregon 2a Ampliación Presidentes                      1299
 5 Alvaro Obregon 2a Del Moral del Pueblo de Tetelpan            1700
 6 Alvaro Obregon 2a Sección Cañada                              1269
 7 Alvaro Obregon 2o Reacomodo Tlacuitlapa                       1650
 8 Alvaro Obregon 8 de Agosto                                    1180
 9 Alvaro Obregon Abeto                                          1440
10 Alvaro Obregon Abraham M. González                            1170
# … with 2,004 more rows
# ℹ Use `print(n = ...)` to see more rows

Please signup or login to give your own answer.

Click here to cancel reply.

web scraping with loop – Html

Answers

Edit