skip to Main Content

So, I have this as input file, temp.html:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<div id="ext-comp-1725" class="x-window FM-Msg-cls utility-window q-fileExplorer-window q-window show-header-line x-window-noborder x-window-plain x-resizable-pinned q-modal-window" style="position: absolute; z-index: 8020; visibility: visible; left: 188px; top: 62px; width: 900px; display: block;">
<div class="x-window-tl"><div class="x-window-tr"><div class="x-window-tc"><div class="x-window-header x-window-header-noborder x-unselectable x-window-draggable" id="ext-gen1530" style="user-select: none;">
<div class="x-tool-ct x-tool x-tool-bg" id="ext-gen1536"><div class="x-tool x-tool-icon x-tool-close"> </div></div>
<span class="x-window-header-text" id="ext-gen1541">Hello</span>
</div></div></div></div>
</body></html>

I was hoping I could pretty-print and indent tags hierarchically by using xmlstarlet:

$ xmlstarlet fo --html --recover --indent-spaces 2 --omit-decl temp.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
  <body>
<div id="ext-comp-1725" class="x-window FM-Msg-cls utility-window q-fileExplorer-window q-window show-header-line x-window-noborder x-window-plain x-resizable-pinned q-modal-window" style="position: absolute; z-index: 8020; visibility: visible; left: 188px; top: 62px; width: 900px; display: block;">
<div class="x-window-tl"><div class="x-window-tr"><div class="x-window-tc"><div class="x-window-header x-window-header-noborder x-unselectable x-window-draggable" id="ext-gen1530" style="user-select: none;">
<div class="x-tool-ct x-tool x-tool-bg" id="ext-gen1536"><div class="x-tool x-tool-icon x-tool-close"> </div></div>
<span class="x-window-header-text" id="ext-gen1541">Hello</span>
</div></div></div></div>
</div></body>
</html>

… however, as it is obvious from the command output above, it only indents some tags (e.g. it split <html><body> and indented those tags properly) – but fails on others (e.g. it kept </div></div></div></div> in a single line).

Is it possible to persuade/set-up xmlstarlet to split off and indent all tags, one tag per line, with proper indentation?

$ xmlstarlet --version
srcinfo-cache
compiled against libxml2 2.9.10, linked with 21209
compiled against libxslt 1.1.34, linked with 10142

2

Answers


  1. Chosen as BEST ANSWER

    Well, it seems tidy works here (found it via A command-line HTML pretty-printer: Making messy HTML readable):

    $ tidy --version
    HTML Tidy for Windows version 5.8.0
    
    $ tidy -indent -wrap 160 -ashtml -utf8 temp.html
    line 3 column 1 - Warning: missing </div>
    line 2 column 7 - Warning: inserting missing 'title' element
    Info: Doctype given is "-//W3C//DTD HTML 4.0 Transitional//EN"
    Info: Document content looks like HTML 4.01 Strict
    Tidy found 2 warnings and 0 errors!
    
    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
    "http://www.w3.org/TR/REC-html40/loose.dtd">
    <html>
    <head>
      <meta name="generator" content="HTML Tidy for HTML5 for Windows version 5.8.0">
      <title></title>
    </head>
    <body>
      <div id="ext-comp-1725" class=
      "x-window FM-Msg-cls utility-window q-fileExplorer-window q-window show-header-line x-window-noborder x-window-plain x-resizable-pinned q-modal-window"
      style="position: absolute; z-index: 8020; visibility: visible; left: 188px; top: 62px; width: 900px; display: block;">
        <div class="x-window-tl">
          <div class="x-window-tr">
            <div class="x-window-tc">
              <div class="x-window-header x-window-header-noborder x-unselectable x-window-draggable" id="ext-gen1530" style="user-select: none;">
                <div class="x-tool-ct x-tool x-tool-bg" id="ext-gen1536">
                  <div class="x-tool x-tool-icon x-tool-close">
                    &nbsp;
                  </div>
                </div><span class="x-window-header-text" id="ext-gen1541">Hello</span>
              </div>
            </div>
          </div>
        </div>
      </div>
    </body>
    </html>
    
    About HTML Tidy: https://github.com/htacg/tidy-html5
    Bug reports and comments: https://github.com/htacg/tidy-html5/issues
    Official mailing list: https://lists.w3.org/Archives/Public/public-htacg/
    Latest HTML specification: http://dev.w3.org/html5/spec-author-view/
    Validate your HTML documents: http://validator.w3.org/nu/
    Lobby your company to join the W3C: http://www.w3.org/Consortium
    
    Do you speak a language other than English, or a different variant of
    English? Consider helping us to localize HTML Tidy. For details please see
    https://github.com/htacg/tidy-html5/blob/master/README/LOCALIZE.md
    

  2. First convert the input file to XML (a </div> is missing).
    By default format uses an indentation of 2 spaces.

    xmlstarlet -q format --html --recover --omit-decl temp.html |
    xmlstarlet format --omit-decl
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search