Html - Taking Screenshots from Stackoverflow using R

stats_noob
March 14, 2024
142 views
0 votes
2 Answers

I am working with the R programming language.

Suppose I take the answer from this stackoverflow post (from u/Michael Hardy): https://math.stackexchange.com/a/62963/1296713

Using R and the hyperlink (i.e. https://math.stackexchange.com/a/62963/1296713 ), I want to take this answer (only the answer, not the whole post) and save it as a PDF document (with the latex included) from R itself. I am also fine with a png image.

The final output should look like this:

I tried the following code:

# https://cran.r-project.org/web/packages/webshot/readme/README.html
library(webshot)
webshot("https://math.stackexchange.com/a/62963/1296713", "r2.png")

This code ran, but this is including the whole webpage – not just the answer from u/Michael Hardy:

Can someone please show me how to do this correctly? Maybe this can be done more easily using rvest/httr webscraping?

Thanks!

Note:

I also tried the following based on CSS selectors:

library(webshot)

url <- "https://math.stackexchange.com/a/62963/1296713"
selector <- ".s-prose.js-post-body"

webshot(url, "r2.png", selector = selector)

Answers

Using {chromote} to png:

ses <- chromote::ChromoteSession$new()
ses$default_timeout <- 10*60
ses$Page$navigate(url = "https://math.stackexchange.com/a/62963")
ses$view()

doc <- ses$DOM$getDocument()
sel <- "#answer-62963"

nid <- ses$DOM$querySelectorAll(doc$root$nodeId, sel)$nodeIds
box <- ses$DOM$getBoxModel(nid[[1]])

ses$screenshot(selector = sel,
               cliprect = c(box$model$content[[1]],
                            box$model$content[[3]]+box$model$content[[6]],
                            box$model$width,
                            box$model$content[[1]]+box$model$content[[3]]))

- KJ
- March 13, 2024 at 11:18 pm
- 0 votes
0
The simplest but has some limittations is to exec the browser in headless mode for a PDF or a PNG.

Now a problem with headless is that, to add cookies or replies to a login prompt will need some JavaScript magic and that is why puppeteer is best option.

Anyway here is the output from the pure command line, we can easily trim off the top.
```
browser.exe --headless --screenshot=%cd%output.png --window-size="600,1000" --hide-scrollbars https://math.stackexchange.com/questions/62958/considering-brownian-bridge-as-conditioned-brownian-motion/62963#62963
```
In this case I used Opera manually [* see comment below, as to why Opera in a case like this] but normally use Edge or Chrome headless. This alternative will pull the whole set of pages as editable PDF, then cut and paste the searchable contents.

Here just searching for Brownian, in the one page after trimming.
```
browser.exe --headless --print-to-pdf=%cd%output.pdf  --no-pdf-header-footer https://math.stackexchange.com/questions/62958/considering-brownian-bridge-as-conditioned-brownian-motion/62963#62963
```
[*]
Opera uniquely, in manual mode, can save as PDF a whole single HTML page, without page breaks. This makes it easier with PDF not to have to mess about with divisions. So you can edit/delete the unwanted content at will, with a good editor or PDF program).
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Html – Taking Screenshots from Stackoverflow using R

Answers