Long time no see, R…. Well, not really, but I’ve been heads down working on some larger projects lately and didn’t have the time to blog. 1 of the projects should be published soon (plenty of hill-shading, maps & data viz), 1 is still in the making (something JS-intense with Google Apps Script & clasp), but 1 - which is my biggest gig so far - is online now with all features (incl. my first-ever scrollytelling implementation) and will serve as my use case in this post.

2 Theory: Crawl - Extract - Archive

In theory, the approach is a simple 3-step process:

  1. crawl all (or a subset of) pages of a single website / web project (let’s focus on HTML content firat; JS reuires more work but there do exist solutions for scraping dynamic / JS-rendered content)
  2. extract all (or a subset of) external / outbound URLs (http/https)
  3. archive each external URL with a POST request to the Wayback Machine API

Fortunately, #Rstats-Twitter is always there to help out and Peter Meissner hinted at me with Salim Khalil’s RCrawler package (cf. Khalil/Fakir 2017).

3 “Minimum Viable Demo” - fetchURLs(target) & archiveURLs(URLs)

Rcrawler’s syntax might seem a bit unorthodox but the package is really feature-rich and we can actually solve Step 1 & 2 with it (and purrr). Having crawled all external link, all we have to do is to figure out how to send a POST request to the Wayback API, and can then work on refinement and parallelization.

3.1 Setup

library(tidyverse)
library(Rcrawler)

# use all CPU cores (~threads) minus 1 to parallelize scraping
cores = (parallel::detectCores() - 1)
print(paste0("CPU threads available: ", cores))
## [1] "CPU threads available: 7"

3.2 fetchURLs(url) - Function to crawl & scrape a single Website

Rcrawler() implicitly returns an INDEX objects to the global environment. INDEX$Url contains a list of URLs vectors.

fetchURLs <- function(url, ignoreStrings = NULL) {
  
  # Crawl our target website's pages
  Rcrawler(
    Website = url,
    no_cores = cores,
    no_conn = 8, # don't abuse this ;)
    saveOnDisk = FALSE # we don't the the HTML files of our own website
  )
  
  # extract external URLs from each scraped page
  urls <- purrr::map(INDEX$Url, ~LinkExtractor(.x,
                                                ExternalLInks = TRUE,
                                                removeAllparams = TRUE))
  
  #  only keep external links
  urls_df <- urls %>%
    rvest::pluck("ExternalLinks") %>% 
    map_df(tibble)
  
  # helper: optional vector with string/Regex expressions to filter out specific unwanted links
  if (!is.null(ignoreStrings)) {
    urls_df <- urls_df %>% 
      filter(!str_detect(`<chr>`, ignoreStrings))
  }
  
  urls_df <- urls_df %>%
    select(`<chr>`) %>%  # the column which contains the links
    pull() %>% # we just need a single vector of URLs
    unique() %>% # keep only unique
    str_replace("http:", "https:") # make all links https
  
  return(urls_df)
}

3.3 archiveURL() - Function to archive a single URL with a POST request

TODO: Check first with internetarchive package whether a link has already been archived, and if true archive only if archived link isn’t n days old.

TODO: return 2 items: originalURL, archivedURL

archiveURL <- function(target_url) {
  
  # Communiy-built Wayback Machine API / endpoint
  endpoint <- "https://pragma.archivelab.org"
  
  # Alterntive Approach might be to use a quick & dirty GET request
  # https://github.com/motherboardgithub/mass_archive/blob/master/mass_archive.py
  
  # the POST request to the API
  response <- httr::POST(url = endpoint,
                         body = list(url = target_url),
                         encode = "json")
  
  status_code <- response$status_code
  
  # Only archive if POST request was succesfull
  # TODO: needs more robustness, i.e. some returned paths are not valid
  # Hypothesis: Not valid because URL was archived recently
  # TODO: add check whether URL has been archived already
  
  if (status_code == 200) {
    wayback_path <- httr::content(response)$wayback_id # returns path
    wayback_url <- paste0("https://web.archive.org", wayback_path)
    print(paste0("Success - Archived: ", target_url))
  } else {
    wayback_url <- NULL
    print(paste0("Error Code: ", status_code, " - couldn't archive ", target_url))
  }
  
  return(wayback_url)
  
}

3.4 Wrapper function to archive a vector of URLs:

archiveURLs <- function(urls) {
  result <- purrr::map_chr(urls, archiveURL) %>% 
     tibble(wayback_url = .)
  return(result)
  }

3.5 Proof-of-Concept - Execution

4 Questions/TBD

  • How to parallelize archival (async calls are not supported by httr it seems, only curl?
  • How to obey quota limitations (with ~setTimeout())?
  • maybe use internetarchive to check first whether a URL has already been archived
  • or use generic API GET requests to retrieve URL, status and ID https://archive.readme.io/docs/website-snapshots

That’s it for today. Hope this is useful for some! If you have a proof-of-concept for making parallel async GET/POST requests, hit me up!