I’ve been scraping a file, but now there’s a new URL – I just tried to chg. the URL and CSS-selector – but my link
-object don’t result in a searchpath but "character (empty)" – what’s seems to be the problem?
Site: https://arbetsformedlingen.se/statistik/statistik-om-varsel
I want to grab the file "Tillfällig statistik per län och bransch, januari-april 2023" in the ‘box’ Antal varsel och berörda personer.
R-code:
library(tidyverse)
library(stringr)
library(rio) #import-function
library(rvest) #read_html()-function
# Link to target site
url <- "https://arbetsformedlingen.se/statistik/statistik-om-varsel"
## Parsa HTML-innehållet
doc <- read_html(url)
## Hitta data som du vill skrapa
### Select CSS locator
link <- html_elements(doc, css = '#cardContainer > app-downloads:nth-child(3) > div > div:nth-child(3) > div > digi-link-internal > digi-link > a') %>%
html_attr("href")
# Create URL for file download
url2 <- "https://arbetsformedlingen.se"
full_link <- sprintf("%s%s", url2, link)
# Get and save file locally
td = tempdir() # skapa temporär mapp
varsel_fil <- tempfile(tmpdir=td, fileext = ".xlsx")
download.file(full_link, destfile = varsel_fil, mode = "wb")
# Read file into a df
df_imported <- import(varsel_fil, which=1) #which - välj 'sheet'-nr
Previously the css-argument in the html_elements
-function was #svid12_142311c317a09e842af1a94 > div.sv-text-portlet-content > p:nth-child(20) > strong > a
-> So the beginning is quite different – I don’t understand what it implies though..
Thanks for any assistance!
2
Answers
So thanks to @margusl this code would suffice to do what my original code did:
This code select third list from json-response, it's a good idea to first create an object and inspect the json-response. The first suggestion (with more code) is a bit more rigid and select list based on criteria.
That page is now mostly rendered by JavaScript and most of that content is not included in the page source, you can check by disabling JS for the site in your browser. List of files in the described box is pulled from
https://arbetsformedlingen.se/rest/analysportalen/va/sitevision
.A quick way to find this API endpoint would be through the network tab of browser’s developer tools — after launching dev tools, refresh the page to capture all requests and search for some phrase that can’t be found from the source of the main page, i.e. "januari-april", looks something like this. Once the API endpoint with file list is identified, we can extract the file URL and proceed with the download:
Downloaded file:
Created on 2023-05-20 with reprex v2.0.2
More base-like approach would perhaps be: