Scraping PDF or pdf text when the file is embedded in html

user17661126
October 24, 2024
102 views
1 vote
2 Answers

Using R, I am trying to get the text (ideally, with some formatting) of a pdf embedded in html. THe url, as an example, is
"https://www.nycourts.gov/courts/ad2/Handdowns/2024/10-October/10-02-2024_FINAL_HANDDOWN_LIST.pdf"

using pdf_text doesn’t work:

> pdf_text <- pdf_text("https://www.nycourts.gov/courts/ad2/Handdowns/2024/10-October/10-02-2024_FINAL_HANDDOWN_LIST.pdf")
Error in open.connection(con, "rb") : 
  cannot open the connection to 'https://www.nycourts.gov/courts/ad2/Handdowns/2024/10-October/10-02-2024_FINAL_HANDDOWN_LIST.pdf'
In addition: Warning message:
In open.connection(con, "rb") :
  cannot open URL 'https://www.nycourts.gov/courts/ad2/Handdowns/2024/10-October/10-02-2024_FINAL_HANDDOWN_LIST.pdf': HTTP status was '403 Forbidden'

I’ve also tried using RSelenium to navigate to the page and glean anything from the html, with no luck:

> remDr$navigate("https://www.nycourts.gov/courts/ad2/Handdowns/2024/10-October/10-02-2024_FINAL_HANDDOWN_LIST.pdf")
> pageHTML <- remDr$getPageSource()[[1]]
> pageHTML
[1] "<html><head></head><body style="height: 100%; width: 100%; overflow: hidden; margin:0px; background-color: rgb(82, 86, 89);"><embed name="843DE9299AC47C3596F8B8E1296AD1FC" style="position:absolute; left: 0; top: 0;" width="100%" height="100%" src="about:blank" type="application/pdf" internalid="843DE9299AC47C3596F8B8E1296AD1FC"></body></html>"

If it’s not possible to just get the text, I’d be happy to download the pdf automatically and then pdf_text the file, but I have not been able to run Rselenium to do that.

Answers

- KJ
- October 24, 2024 at 6:56 pm
- 0 votes
0
To open a remote PDF in any viewer it needs to be collected from remote and decompressed into local screen pixels. This is done using browsers, so the PDF is downloaded first then returned to the browser glass window as pixels.

Exactly the same function from a device command line is
```
curl -A "Mozilla ()/20100101 Firefox/81.0" -O https://www.nycourts.gov/courts/ad2/Handdowns/2024/10-October/10-02-2024_FINAL_HANDDOWN_LIST.pdf & 10-02-2024_FINAL_HANDDOWN_LIST.pdf
```
Once you have that binary data you can export the hyperlinks using any suitable shell function so coherent cpdf has a simple output for Json format, here using windows find filter.
```
cpdf -list-annotations-json 10-02-2024_FINAL_HANDDOWN_LIST.pdf |find "https"
```
We can then at that time loop back so the curl will download each of those PDF links same as the first one.

Alternatively you can try similar in PDF.JS web scrappers by search for the PDF internals then run as a list to extract the separate PDF references from the decrypted PDF but that is a Mozilla not Chrome ability.
Login or Signup to reply.

- margusl
- October 24, 2024 at 7:54 pm
- 0 votes
0
They seem to be using some Cloudflare services, probably enforcing some anti-bot measures that can go well beyond simple user-agent checking. So be prepared for various success rates depending on your your network, your system, load balancer you happen to connect to, time of the day, moon phase and what not.

If you are lucky, you might get away with just setting HTTPUserAgent option:
```
library(pdftools)
#> Using poppler version 23.08.0

pdf_url <- "https://www.nycourts.gov/courts/ad2/Handdowns/2024/10-October/10-02-2024_FINAL_HANDDOWN_LIST.pdf"
ua_chrome_long   <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36"

pdf_text <- 
  withr::with_options(
    list(HTTPUserAgent = ua_chrome_long), 
    pdf_text(pdf_url)
)
```
Or try to pass extra Upgrade-Insecure-Requests header when creating url connection:
```
url(
  pdf_url, 
  headers = c(
    `Upgrade-Insecure-Requests` = "1",
    `User-Agent` = ua_chrome_long)
  ) |> 
  pdf_text()
```
Or do the same with httr2:
```
library(httr2)
request(pdf_url) |> 
  req_headers(
    `Upgrade-Insecure-Requests` = "1",
    `User-Agent` = ua_chrome_long
  ) |> 
  req_perform() |> 
  resp_body_raw() |> 
  pdf_text()
```
Those are all libcurl-based and during my brief testing, all 3 succeeded and failed at some point. Good luck!
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.