skip to Main Content

I’m trying to download zip codes that are in different pages. I started with a list of nodes for each municipality inside Mexico City.

url<-"https://www.codigopostal.lat/mexico/Ciudad-de-Mexico/"
resource<-GET(url)
parse<-htmlParse(resource)
links<-as.character(xpathSApply(parse,path="//a",xmlGetAttr,"href"))
print(links)

And then I’m trying to create a loop that grabs each url and grabs the table of zip codes to later create a big corpus of each matrix created per municipality:

scraper<-function(url){
  html<-read_html(url)
  tabla<-html%>%
    html_elements("td , th") %>%
    html_text2()
  data<-matrix(ncol=3,nrow=length(tabla))
  data<-data.frame(matrix(tabla,nrow=length(tabla),ncol=3,byrow=TRUE)) %>% 
    row_to_names(row_number=1)
}

I will have "municipality", "locality", "zp", that’s why the number of columns is 3, but it seems that:
"Error: x must be a string of length 1" and I also cannot add up all the matrices.
Any ideas are greatly appreciated!

2

Answers


  1. Here is a way to scrape the zip codes of Ciudad-de-Mexico.

    suppressPackageStartupMessages({
      library(rvest)
      library(magrittr)
    })
    
    scraper <- function(link) {
      link %>%
        read_html() %>%
        html_table() %>%
        `[[`(1)
    }
    
    url <- "https://www.codigopostal.lat/mexico/Ciudad-de-Mexico/"
    
    page <- read_html(url)
    zip_codes_list <- page %>%
      html_elements("a") %>%
      html_attr("href") %>%
      grep("mexico/Ciudad-de-Mexico/.+", ., value = TRUE) %>%
      lapply(scraper)
    

    Then rbind them all together.

    zip_codes <- do.call(rbind, zip_codes_list)
    

    Edit

    In the original post I have loaded package dplyr. After a second thought I have realized that it’s only loaded to make the magrittr pipe operator available, so I have changed the code to only load the relevant package, magrittr.

    Login or Signup to reply.
  2. library(tidyverse)
    library(rvest)
    
    "https://www.codigopostal.lat/mexico/Ciudad-de-Mexico/" %>%  
      read_html() %>%  
      html_elements(".ctrLink a") %>%  
      html_attr("href") %>% # Grab all links from the main URL 
      map_dfr(~ .x %>% # Map through municipalities, scrape tables and row bind
                read_html() %>% 
                html_table())
    
    # Consider janitor::clean_names() at the end
    
    
    # A tibble: 2,014 × 3
       Municipio      Localidad                           `Código Postal`
       <chr>          <chr>                                         <int>
     1 Alvaro Obregon 1a Ampliación Presidentes                      1299
     2 Alvaro Obregon 1a Sección Cañada                              1269
     3 Alvaro Obregon 1a Victoria                                    1160
     4 Alvaro Obregon 2a Ampliación Presidentes                      1299
     5 Alvaro Obregon 2a Del Moral del Pueblo de Tetelpan            1700
     6 Alvaro Obregon 2a Sección Cañada                              1269
     7 Alvaro Obregon 2o Reacomodo Tlacuitlapa                       1650
     8 Alvaro Obregon 8 de Agosto                                    1180
     9 Alvaro Obregon Abeto                                          1440
    10 Alvaro Obregon Abraham M. González                            1170
    # … with 2,004 more rows
    # ℹ Use `print(n = ...)` to see more rows
    
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search