Html - R: Webscraping Wayback Machine

stats_noob
July 2, 2023
250 views
0 votes
2 Answers

I am working with the R programming language.

For the following website : https://covid.cdc.gov/covid-data-tracker/ – I am trying to get all versions of this website that are available on WayBackMachine (along with the month and time). The final result should look something like this:

         date                                                                                links     time
1 Jan-01-2023 https://web.archive.org/web/20230101000547/https://covid.cdc.gov/covid-data-tracker/ 00:05:47
2 Jan-01-2023 https://web.archive.org/web/20230101000557/https://covid.cdc.gov/covid-data-tracker/ 00:05-57

Here is what I tried so far:

Step 1: First, I inspected the HTML source code (within "elements") and copied/pasted it into a notepad file:

Step 2: Then, I imported this into R and parsed the resulting html for the link structure:

file <- "cdc.txt"

text <- readLines(file)

html <- paste(text, collapse = "n")

pattern1 <- '/web/\d+/https://covid\.cdc\.gov/covid-data-tracker/[^"]+'

links <- regmatches(html, gregexpr(pattern1, html))[[1]]

But this is not working :

> links
character(0)

Can someone please show me if there is an easier way to do this?

Thanks!

Note:

I am trying to learn how to do this in general (i.e. for any websites on WayBackMachine – the Covid Data Tracker is just an placeholder example for this question)
I realize that there might be much more efficient ways to do this – I open to learning about different approaches!

Tags: html r web-scraping

Answers

This is really two questions. The html is generated client side rather than server side. This is why you cannot just request the html from R to get what you need, and end up copying and pasting from Developer Tools. You can automate this by using RSelenium. The docs are extensive so I won’t cover that in the answer.

You should also use a parser like rvest to parse the html, rather than regular expressions. In this case, to get the output you want, that would look something like:

library(rvest)

url <- "wayback.html"
page <- read_html(url)

# Find correct links
links <- page |>
    html_elements("a") |>
    html_attr("href") |>
    grep("/web/\d.+/https://covid.cdc.gov/covid-data-tracker/$", x = _, value = T)

# Create dates
dates <- as.Date(
    gsub("/web/(\d{8}).+$", "\1", links),
    format = "%Y%m%d"
)

# Prepend base URL
links <- paste0("https://web.archive.org/", links)

dat <- data.frame(dates, links)
head(dat)

#        dates                                                                                 links
# 1 2020-08-24 https://web.archive.org//web/20200824224244/https://covid.cdc.gov/covid-data-tracker/
# 2 2023-06-30 https://web.archive.org//web/20230630234650/https://covid.cdc.gov/covid-data-tracker/
# 3 2023-06-29 https://web.archive.org//web/20230629011221/https://covid.cdc.gov/covid-data-tracker/
# 4 2023-01-01       https://web.archive.org//web/20230101/https://covid.cdc.gov/covid-data-tracker/
# 5 2023-01-02       https://web.archive.org//web/20230102/https://covid.cdc.gov/covid-data-tracker/
# 6 2023-01-03       https://web.archive.org//web/20230103/https://covid.cdc.gov/covid-data-tracker/

Archive.org provides Wayback CDX API for looking up captures, it returns timestamps along with original urls in tabular form or JSON. Such queries can be made with read.table() alone, links to specific captures can then be constructed from timestamp and original columns and base URL.

read.table("https://web.archive.org/cdx/search/cdx?url=covid.cdc.gov/covid-data-tracker/&limit=5", 
           col.names = c("urlkey","timestamp","original","mimetype","statuscode","digest","length"),
           colClasses = "character")
#>                              urlkey      timestamp
#> 1 gov,cdc,covid)/covid-data-tracker 20200824224244
#> 2 gov,cdc,covid)/covid-data-tracker 20200825013347
#> 3 gov,cdc,covid)/covid-data-tracker 20200825024622
#> 4 gov,cdc,covid)/covid-data-tracker 20200825042657
#> 5 gov,cdc,covid)/covid-data-tracker 20200825050018
#>                                    original  mimetype statuscode
#> 1 https://covid.cdc.gov/covid-data-tracker/ text/html        200
#> 2 https://covid.cdc.gov/covid-data-tracker/ text/html        200
#> 3 https://covid.cdc.gov/covid-data-tracker/ text/html        200
#> 4 https://covid.cdc.gov/covid-data-tracker/ text/html        200
#> 5 https://covid.cdc.gov/covid-data-tracker/ text/html        200
#>                             digest length
#> 1 APS6SXNXBXCJU3P4N23WH4XCVDVZQYAD   5342
#> 2 XFEMFRGXIPWM4K5F6CBIYDSOFIGCUBQZ   5370
#> 3 TVQKZHRM452CFX4RIORWGSMK5PG3PAPR   5343
#> 4 XZDLPJ6EQIXEO4SUFQTFEX4S6SF7O4GT   5370
#> 5 A4J63TFU7HMZQE5KFTSLBD6EFNZ4IBZ4   5373

To make it a bit more convenient to work with, we can customize API request with httr / httr2, for example, and pass the response through readr / dplyr / lubridate pipeline:

library(dplyr)
library(httr2)
library(readr)

archive_links <- request("https://web.archive.org/cdx/search/cdx") %>% 
  # set query parameters
  req_url_query(
    url      = "covid.cdc.gov/covid-data-tracker/",
    filter   = "statuscode:200", # include only succesful captures where HTTP status code was 200
    collapse = "timestamp:8",    # limit to 1 capt. per day by comparing first 8 digits of timestamp: <20200824>224244
    limit    = 10,               # limit the number of returned values
    # output = "json"            # request json output, includes column names
  ) %>% 
  req_perform() %>%
  # pass http response string to read_table() for pasring
  resp_body_string() %>% 
  read_table(col_names = c("urlkey","timestamp","original","mimetype","statuscode","digest","length"),
             col_types = cols_only(timestamp = "c",
                                   original  = "c",
                                   mimetype  = "c",
                                   length    = "i")) %>% 
  mutate(link = paste("https://web.archive.org/web", timestamp, original, sep = "/") %>% tibble::char(shorten = "front"),
         timestamp = lubridate::ymd_hms(timestamp)) %>% 
  select(timestamp, link, length)
archive_links
#> # A tibble: 10 × 3
#>    timestamp           link                                               length
#>    <dttm>              <char>                                              <int>
#>  1 2020-08-24 22:42:44 …4224244/https://covid.cdc.gov/covid-data-tracker/   5342
#>  2 2020-08-25 01:33:47 …5013347/https://covid.cdc.gov/covid-data-tracker/   5370
#>  3 2020-08-26 02:37:09 …6023709/https://covid.cdc.gov/covid-data-tracker/   5371
#>  4 2020-08-27 01:05:48 …7010548/https://covid.cdc.gov/covid-data-tracker/   5703
#>  5 2020-08-28 02:23:26 …8022326/https://covid.cdc.gov/covid-data-tracker/  31177
#>  6 2020-08-29 02:01:27 …9020127/https://covid.cdc.gov/covid-data-tracker/  31237
#>  7 2020-08-30 00:06:31 …0000631/https://covid.cdc.gov/covid-data-tracker/  31218
#>  8 2020-08-31 00:18:29 …1001829/https://covid.cdc.gov/covid-data-tracker/  31640
#>  9 2020-09-01 02:30:30 …1023030/https://covid.cdc.gov/covid-data-tracker/  31257
#> 10 2020-09-02 04:08:31 …2040831/https://covid.cdc.gov/covid-data-tracker/  31654

# first capture:
archive_links$link[1]
#> <pillar_char<[1]>
#> [1] https://web.archive.org/web/20200824224244/https://covid.cdc.gov/covid-data-tracker/

^{Created on 2023-07-02 with reprex v2.0.2}

There are also Archive.org client libraries for R, e.g.
https://github.com/liserman/archiveRetriever & https://hrbrmstr.github.io/wayback/ , though the query interface for the first is bit odd, and the other is currently not available through CRAN.

Please signup or login to give your own answer.

Click here to cancel reply.

Html – R: Webscraping Wayback Machine

Answers