Html - Automatic wikipedia image script

Fruuutz
December 11, 2023
231 views
0 votes
2 Answers

So I made this script to automatically download the picture of a species from the general info box on Wikipedia. I have this data frame containing all the (latin) names of the species where I then want to automatically download the Wikipedia species picture and put them on a map.

Wikipedia link example:
https://en.wikipedia.org/wiki/Eurasian_eagle-owl

However, my script downloads the low-quality version of the picture. How can I modify it so that it downloads the original file with the best quality?

Dataframe example:

> bird_names
  [1] "Prunella modularis"            "Myiopsitta monachus"          
  [3] "Pyrrhura perlata"              "Tyto alba"                    
  [5] "Panurus biarmicus"             "Merops apiaster"

Script:

# Function to download and save an image from Wikipedia
download_wikipedia_image <- function(bird_name) {
  # Construct the Wikipedia URL for the bird species
  wikipedia_url <- paste0("https://en.wikipedia.org/wiki/", gsub(" ", "_", bird_name))
  
  # Read the HTML content of the Wikipedia page
  page <- read_html(wikipedia_url)
  
  # Extract all image URLs from the page
  image_urls <- page %>%
    html_nodes("table.infobox img") %>%
    html_attr("src")
  
  # Download and save the first image (if available)
  if (length(image_urls) > 0) {
    download.file(paste0("https:", image_urls[1]), paste0("BIRDPHOTO/", gsub(" ", "_", bird_name), ".jpg"))
    cat("Downloaded photo for", bird_name, "n")
  } else {
    cat("No photo found for", bird_name, "n")
  }
}

# Create BIRDPHOTO directory if it doesn't exist
dir.create("BIRDPHOTO", showWarnings = FALSE)

# Loop through each bird name and download the corresponding image
for (bird_name in bird_names) {
  download_wikipedia_image(bird_name)
}

# Optional: Print a message when all downloads are complete
cat("All downloads completed.n")

Tags: html image r rvest wikipedia

Answers

That’s because you have to follow the low quality photo to wiki page (i.e. https://en.wikipedia.org/wiki/File:Baardman_-_Panurus_biarmicus_(15147085070).jpg) and search for Original file link, like:

bird_name <- "Panurus biarmicus"  

  # Construct the Wikipedia URL for the bird species
  wikipedia_url <- paste0("https://en.wikipedia.org/wiki/", gsub(" ", "_", bird_name))
  
  # Read the HTML content of the Wikipedia page
  page <- xml2::read_html(wikipedia_url)
  
  # Extract all image URLs from the page
  urls <- page |>
    rvest::html_nodes("table.infobox") |>
    rvest::html_elements(css = "a.mw-file-description")  |>
    rvest::html_attr("href")

  urls
#> [1] "/wiki/File:Baardman_-_Panurus_biarmicus_(15147085070).jpg"
#> [2] "/wiki/File:PanurusBiarmicusIUCN2019-3.png"
  
  image_url <- xml2::read_html(paste0("https://en.wikipedia.org/", urls[[1]])) |>
    rvest::html_nodes("div.fullMedia") |>
    rvest::html_element(css = "a") |>
    rvest::html_attr("href")
    
  image_url
#> [1] "//upload.wikimedia.org/wikipedia/commons/f/fb/Baardman_-_Panurus_biarmicus_%2815147085070%29.jpg"

Created on 2023-12-11 with reprex v2.0.2

Use this:

  # Download and save the first image (if available)
  if (length(image_urls) > 0) {
    #extract the file name
    file_name <- strsplit(image_urls[1],"/")[[1]][9]
    #query API
    req <- httr::GET(glue::glue("https://en.wikipedia.org/w/api.php?action=query&titles=File:{file_name}&prop=imageinfo&iiprop=url&format=json"))
    cont <- httr::content(req)
    #extract url from query
    full_image_url <- cont$query$pages$`-1`$imageinfo[[1]]$url
    download.file(full_image_url, paste0("BIRDPHOTO/", gsub(" ", "_", bird_name), ".jpg"))
    cat("Downloaded photo for", bird_name, "n")
  } else {
    cat("No photo found for", bird_name, "n")
  }

Please signup or login to give your own answer.

Click here to cancel reply.

Html – Automatic wikipedia image script

Answers