skip to Main Content

I am working with the R programming language.

Suppose I take the answer from this stackoverflow post (from u/Michael Hardy): https://math.stackexchange.com/a/62963/1296713

Using R and the hyperlink (i.e. https://math.stackexchange.com/a/62963/1296713 ), I want to take this answer (only the answer, not the whole post) and save it as a PDF document (with the latex included) from R itself. I am also fine with a png image.

The final output should look like this:

enter image description here

I tried the following code:

# https://cran.r-project.org/web/packages/webshot/readme/README.html
library(webshot)
webshot("https://math.stackexchange.com/a/62963/1296713", "r2.png")

This code ran, but this is including the whole webpage – not just the answer from u/Michael Hardy:

enter image description here

Can someone please show me how to do this correctly? Maybe this can be done more easily using rvest/httr webscraping?

Thanks!

  • Note:

I also tried the following based on CSS selectors:

library(webshot)

url <- "https://math.stackexchange.com/a/62963/1296713"
selector <- ".s-prose.js-post-body"

webshot(url, "r2.png", selector = selector)

2

Answers


  1. Using {chromote} to png:

    ses <- chromote::ChromoteSession$new()
    ses$default_timeout <- 10*60
    ses$Page$navigate(url = "https://math.stackexchange.com/a/62963")
    ses$view()
    
    doc <- ses$DOM$getDocument()
    sel <- "#answer-62963"
    
    nid <- ses$DOM$querySelectorAll(doc$root$nodeId, sel)$nodeIds
    box <- ses$DOM$getBoxModel(nid[[1]])
    
    ses$screenshot(selector = sel,
                   cliprect = c(box$model$content[[1]],
                                box$model$content[[3]]+box$model$content[[6]],
                                box$model$width,
                                box$model$content[[1]]+box$model$content[[3]]))
    
    

    Login or Signup to reply.
  2. The simplest but has some limittations is to exec the browser in headless mode for a PDF or a PNG.

    Now a problem with headless is that, to add cookies or replies to a login prompt will need some JavaScript magic and that is why puppeteer is best option.

    Anyway here is the output from the pure command line, we can easily trim off the top.
    --hide-scrollbars

    browser.exe --headless --screenshot=%cd%output.png --window-size="600,1000" --hide-scrollbars https://math.stackexchange.com/questions/62958/considering-brownian-bridge-as-conditioned-brownian-motion/62963#62963
    

    In this case I used Opera manually [* see comment below, as to why Opera in a case like this] but normally use Edge or Chrome headless. This alternative will pull the whole set of pages as editable PDF, then cut and paste the searchable contents.

    Here just searching for Brownian, in the one page after trimming.
    enter image description here

    browser.exe --headless --print-to-pdf=%cd%output.pdf  --no-pdf-header-footer https://math.stackexchange.com/questions/62958/considering-brownian-bridge-as-conditioned-brownian-motion/62963#62963
    

    [*]
    Opera uniquely, in manual mode, can save as PDF a whole single HTML page, without page breaks. This makes it easier with PDF not to have to mess about with divisions. So you can edit/delete the unwanted content at will, with a good editor or PDF program).

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search