Using R, I am trying to get the text (ideally, with some formatting) of a pdf embedded in html. THe url, as an example, is
"https://www.nycourts.gov/courts/ad2/Handdowns/2024/10-October/10-02-2024_FINAL_HANDDOWN_LIST.pdf"
using pdf_text doesn’t work:
> pdf_text <- pdf_text("https://www.nycourts.gov/courts/ad2/Handdowns/2024/10-October/10-02-2024_FINAL_HANDDOWN_LIST.pdf")
Error in open.connection(con, "rb") :
cannot open the connection to 'https://www.nycourts.gov/courts/ad2/Handdowns/2024/10-October/10-02-2024_FINAL_HANDDOWN_LIST.pdf'
In addition: Warning message:
In open.connection(con, "rb") :
cannot open URL 'https://www.nycourts.gov/courts/ad2/Handdowns/2024/10-October/10-02-2024_FINAL_HANDDOWN_LIST.pdf': HTTP status was '403 Forbidden'
I’ve also tried using RSelenium to navigate to the page and glean anything from the html, with no luck:
> remDr$navigate("https://www.nycourts.gov/courts/ad2/Handdowns/2024/10-October/10-02-2024_FINAL_HANDDOWN_LIST.pdf")
> pageHTML <- remDr$getPageSource()[[1]]
> pageHTML
[1] "<html><head></head><body style="height: 100%; width: 100%; overflow: hidden; margin:0px; background-color: rgb(82, 86, 89);"><embed name="843DE9299AC47C3596F8B8E1296AD1FC" style="position:absolute; left: 0; top: 0;" width="100%" height="100%" src="about:blank" type="application/pdf" internalid="843DE9299AC47C3596F8B8E1296AD1FC"></body></html>"
If it’s not possible to just get the text, I’d be happy to download the pdf automatically and then pdf_text the file, but I have not been able to run Rselenium to do that.
2
Answers
To open a remote PDF in any viewer it needs to be collected from remote and decompressed into local screen pixels. This is done using browsers, so the PDF is downloaded first then returned to the browser glass window as pixels.
Exactly the same function from a device command line is
Once you have that binary data you can export the hyperlinks using any suitable shell function so coherent cpdf has a simple output for Json format, here using windows find filter.
We can then at that time loop back so the curl will download each of those PDF links same as the first one.
Alternatively you can try similar in PDF.JS web scrappers by search for the PDF internals then run as a list to extract the separate PDF references from the decrypted PDF but that is a Mozilla not Chrome ability.
They seem to be using some Cloudflare services, probably enforcing some anti-bot measures that can go well beyond simple user-agent checking. So be prepared for various success rates depending on your your network, your system, load balancer you happen to connect to, time of the day, moon phase and what not.
If you are lucky, you might get away with just setting
HTTPUserAgent
option:Or try to pass extra
Upgrade-Insecure-Requests
header when creating url connection:Or do the same with
httr2
:Those are all libcurl-based and during my brief testing, all 3 succeeded and failed at some point. Good luck!