skip to Main Content

I’m working on a project where I need to process HTML tables generated by the tinytable package. The HTML includes JavaScript that dynamically applies CSS styling to the table cells. My goal is to extract the modified plain HTML after all the styling has been applied.

This is an example table that I would like to process:

library(tinytable)
tt(mtcars[1:4, 1:4]) |> 
    style_tt(j = 1:2, background = "teal", color = "white") |>
    save_tt("example.html", overwrite = TRUE)

This code saves an example.html file with colors applied by JavaScript. I would like to convert that to plain HTML with the styles.

I am very open to suggestions on alternatives. The one path I tried was to save the HTML to a temporary file, use servr to serve the file, then chromote to browse the file headlessly and to extract. However, I keep running into timeout issues.

Again, I’m happy to try a different strategy if you can propose something more effective or direct.

Here’s what I tried so far:

library(servr)
library(chromote)
library(tinytable)

serve_and_strip <- function(filename) {
    fn <- file.path(tempdir(), "index.html")
    file.copy(filename, fn, overwrite = TRUE)
    srv <- servr::httd(tempdir())
    url <- file.path(srv$url, "index.html")
    b <- ChromoteSession$new()
    b$Page$navigate(url)
    tab <- b$Runtime$evaluate("document.querySelector('table').outerHTML")$result$value
    sty <- b$Runtime$evaluate("document.querySelector('style').outerHTML")$result$value
    out <- list(tab, sty)
    b$close()
    servr::daemon_stop(srv$daemon)
    return(out)
}

serve_and_strip("example.html")

Edit: If I just scrape the HTML file, the first cell shows up as <td>21.0</td>. However, if you load the page in Firefox or Chrome and right-click to "Inspect" the cell, you’ll see that it has become: <td class="tinytable_css_n9oxlmixvkthzx38wcrd">21.0</td>. This is because the Javascript functions were run by Firefox, and have added class information to the cell. What I want to retrieve is the HTML and CSS code from the page after applying JS functions. This is why I suggested going through a headless browser.

The pre-transformation HTML code looks like this:

<!DOCTYPE html> 
<html lang="en">
  <head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>tinytable_xg1s2bqenh3yuyr9x2mg</title>
    <link href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css" rel="stylesheet">
    <style>
.table td.tinytable_css_u43qs0b5ucz8ik5jm341, .table th.tinytable_css_u43qs0b5ucz8ik5jm341 {    border-bottom: solid 0.1em #d3d8dc; }
.table td.tinytable_css_4rjvz3zmw0n1t4i0jlbe, .table th.tinytable_css_4rjvz3zmw0n1t4i0jlbe {    color: white; background-color: teal; }
    </style>
    <script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
    <script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
    <script>
    MathJax = {
      tex: {
        inlineMath: [['$', '$'], ['\(', '\)']]
      },
      svg: {
        fontCache: 'global'
      }
    };
    </script>
  </head>

  <body>
    <div class="container">
      <table class="table table-borderless" id="tinytable_xg1s2bqenh3yuyr9x2mg" style="width: auto; margin-left: auto; margin-right: auto;" data-quarto-disable-processing='true'>
        <thead>
        
              <tr>
                <th scope="col">mpg</th>
                <th scope="col">cyl</th>
                <th scope="col">disp</th>
                <th scope="col">hp</th>
              </tr>
        </thead>
        
        <tbody>
                <tr>
                  <td>21.0</td>
                  <td>6</td>
                  <td>160</td>
                  <td>110</td>
                </tr>
                <tr>
                  <td>21.0</td>
                  <td>6</td>
                  <td>160</td>
                  <td>110</td>
                </tr>
                <tr>
                  <td>22.8</td>
                  <td>4</td>
                  <td>108</td>
                  <td> 93</td>
                </tr>
                <tr>
                  <td>21.4</td>
                  <td>6</td>
                  <td>258</td>
                  <td>110</td>
                </tr>
        </tbody>
      </table>
    </div>

    <script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/js/bootstrap.bundle.min.js"></script>
    <script>
      function styleCell_tinytable_pii6zht3qrjzjy0n9jwp(i, j, css_id) {
        var table = document.getElementById("tinytable_xg1s2bqenh3yuyr9x2mg");
        table.rows[i].cells[j].classList.add(css_id);
      }
      function insertSpanRow(i, colspan, content) {
        var table = document.getElementById('tinytable_xg1s2bqenh3yuyr9x2mg');
        var newRow = table.insertRow(i);
        var newCell = newRow.insertCell(0);
        newCell.setAttribute("colspan", colspan);
        // newCell.innerText = content;
        // this may be unsafe, but innerText does not interpret <br>
        newCell.innerHTML = content;
      }
      function spanCell_tinytable_pii6zht3qrjzjy0n9jwp(i, j, rowspan, colspan) {
        var table = document.getElementById("tinytable_xg1s2bqenh3yuyr9x2mg");
        const targetRow = table.rows[i];
        const targetCell = targetRow.cells[j];
        for (let r = 0; r < rowspan; r++) {
          // Only start deleting cells to the right for the first row (r == 0)
          if (r === 0) {
            // Delete cells to the right of the target cell in the first row
            for (let c = colspan - 1; c > 0; c--) {
              if (table.rows[i + r].cells[j + c]) {
                table.rows[i + r].deleteCell(j + c);
              }
            }
          }
          // For rows below the first, delete starting from the target column
          if (r > 0) {
            for (let c = colspan - 1; c >= 0; c--) {
              if (table.rows[i + r] && table.rows[i + r].cells[j]) {
                table.rows[i + r].deleteCell(j);
              }
            }
          }
        }
        // Set rowspan and colspan of the target cell
        targetCell.rowSpan = rowspan;
        targetCell.colSpan = colspan;
      }

window.addEventListener('load', function () { styleCell_tinytable_pii6zht3qrjzjy0n9jwp(0, 0, 'tinytable_css_u43qs0b5ucz8ik5jm341') })
window.addEventListener('load', function () { styleCell_tinytable_pii6zht3qrjzjy0n9jwp(0, 1, 'tinytable_css_u43qs0b5ucz8ik5jm341') })
window.addEventListener('load', function () { styleCell_tinytable_pii6zht3qrjzjy0n9jwp(0, 2, 'tinytable_css_u43qs0b5ucz8ik5jm341') })
window.addEventListener('load', function () { styleCell_tinytable_pii6zht3qrjzjy0n9jwp(0, 3, 'tinytable_css_u43qs0b5ucz8ik5jm341') })
window.addEventListener('load', function () { styleCell_tinytable_pii6zht3qrjzjy0n9jwp(0, 0, 'tinytable_css_4rjvz3zmw0n1t4i0jlbe') })
window.addEventListener('load', function () { styleCell_tinytable_pii6zht3qrjzjy0n9jwp(0, 1, 'tinytable_css_4rjvz3zmw0n1t4i0jlbe') })
window.addEventListener('load', function () { styleCell_tinytable_pii6zht3qrjzjy0n9jwp(1, 0, 'tinytable_css_4rjvz3zmw0n1t4i0jlbe') })
window.addEventListener('load', function () { styleCell_tinytable_pii6zht3qrjzjy0n9jwp(1, 1, 'tinytable_css_4rjvz3zmw0n1t4i0jlbe') })
window.addEventListener('load', function () { styleCell_tinytable_pii6zht3qrjzjy0n9jwp(2, 0, 'tinytable_css_4rjvz3zmw0n1t4i0jlbe') })
window.addEventListener('load', function () { styleCell_tinytable_pii6zht3qrjzjy0n9jwp(2, 1, 'tinytable_css_4rjvz3zmw0n1t4i0jlbe') })
window.addEventListener('load', function () { styleCell_tinytable_pii6zht3qrjzjy0n9jwp(3, 0, 'tinytable_css_4rjvz3zmw0n1t4i0jlbe') })
window.addEventListener('load', function () { styleCell_tinytable_pii6zht3qrjzjy0n9jwp(3, 1, 'tinytable_css_4rjvz3zmw0n1t4i0jlbe') })
window.addEventListener('load', function () { styleCell_tinytable_pii6zht3qrjzjy0n9jwp(4, 0, 'tinytable_css_4rjvz3zmw0n1t4i0jlbe') })
window.addEventListener('load', function () { styleCell_tinytable_pii6zht3qrjzjy0n9jwp(4, 1, 'tinytable_css_4rjvz3zmw0n1t4i0jlbe') })
    </script>

  </body>

</html>

2

Answers


  1. Chosen as BEST ANSWER

    This solution seems much simpler and appears to work:

    library(chromote)
    url <- "file:/home/username/example.html"
    b <- ChromoteSession$new()
    b$Page$navigate(url)
    b$Page$loadEventFired(wait = FALSE)
    body <- b$Runtime$evaluate("document.querySelector('body').outerHTML")$result$value
    b$close()
    

  2. Maybe I missed something about the output you want, but rvest::html_element() can extract the style and the table. Is this what you want?

    library(tinytable)
    library(rvest)
    
    tt(mtcars[1:4, 1:4]) |> 
      style_tt(j = 1:2, background = "teal", color = "white") |>
      save_tt("example.html", overwrite = TRUE)
    
    read_html("example.html") |> 
      html_element("style") |> 
      html_text() |> 
      cat()
    #> 
    #> .table td.tinytable_css_o5ybk95sz7sanrsb19vo, .table th.tinytable_css_o5ybk95sz7sanrsb19vo {    border-bottom: solid 0.1em #d3d8dc; }
    #> .table td.tinytable_css_1kde5wrgf8bxx69seu93, .table th.tinytable_css_1kde5wrgf8bxx69seu93 {    color: white; background-color: teal; }
    #> 
    
    read_html("example.html") |> 
      html_element("table")
    #> {html_node}
    #> <table class="table table-borderless" id="tinytable_q1wsr4r8dii7yxlfkvsa" style="width: auto; margin-left: auto; margin-right: auto;" data-quarto-disable-processing="true">
    #> [1] <thead><tr>n<th scope="col">mpg</th>rn                <th scope="col"> ...
    #> [2] <tbody>n<tr>n<td>21.0</td>rn                  <td>6</td>rn          ...
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search