skip to Main Content

I am processing xhtml documents into TEI. The xhtml docs were initially old .doc files. The folks that created the doc files used <> chars to indicate when a letter, word or words were added to the orginal document. Here is a snippet of the document.

<p>
&lt;X&gt; presented &lt;<strike>it</strike>&gt;
to others &lt;X&gt; in the form of a babe, [ERASED] X [END ERASE] &lt;at
first — and according to their belief her&gt; <strike>her
mental conception and</strike>
         <strike>they called her</strike>
spiritual conception &lt;of man &lt;brought forth&gt; a material man
&lt;and was&gt; <strike>but</strike>
a&gt; <strike>matter, flesh and
bones</strike> &lt;miraculous
conception (<strike>and</strike>&lt;but&gt;
it was n<strike>either</strike>&lt;ot&gt;)
and&gt; and<sup>
            <a shape="rect" class="sdfootnoteanc" name="sdfootnote138anc"
               href="#sdfootnote138sym">
               <sup>138</sup>
            </a>
         </sup>
</p>

I’m using xsl to process the xhtml, and the function

<xsl:when test="matches($text, '&lt;.+&gt;', 's')">

Which works as long as there aren’t any nested pairs. The text below an example of nesting from the sample above.

&lt;of man &lt;brought forth&gt; a material man &lt;and was&gt; <strike>but</strike> a&gt;

The matches() function fails because for the first &lt it grabs the first &gt it see’s which is a match for the second &lt not the first. After that things go downhill.

I was considering using sed and regex to replace the &alt &gt pairs but I don’t see how that wouldn’t fail for the same reason.

Looking for a good idea on how to tackle this.

thanks,

Scott

2

Answers


  1. The approach illustrated below uses text replacement to convert the <> angles into opening and closing ins tags. Then it parses the document and processes the <ins> elements in reverse document order so that insertions that are nested into other insertions are processed before the outer ones. This processing consists of text replacement on innerHTML, so that all children of an <ins> element can be treated together, whether they are text nodes or other elements.

    Which text replacement you use depends on what you want to achieve. In the example, insertions are converted into <span> elements with a border so that the effect of nesting can be seen.

    var p = document.querySelector("p");
    var doc = new DOMParser().parseFromString(p.innerHTML
      .replace(/&lt;/g, "<ins>")
      .replace(/&gt;/g, "</ins>"), "text/html");
    var nodes = [];
    for (var node, iter = doc.evaluate("//ins", doc.body, undefined, XPathResult.ORDERED_NODE_ITERATOR_TYPE);
      (node = iter.iterateNext());)
      nodes.unshift(node);
    for (node of nodes)
      /* The next statement would destroy any nested <ins> nodes,
         but we have already processed them before. */
      node.outerHTML = '<span>' + node.innerHTML + '</span>';
    p.innerHTML = doc.body.innerHTML;
    span {
      display: inline-block;
      border: 1px solid black;
      margin: 1px;
    }
    <p>
      &lt;X&gt; presented &lt;<strike>it</strike>&gt; to others &lt;X&gt; in the form of a babe, [ERASED] X [END ERASE] &lt;at first &#8212; and according to their belief her&gt; <strike>her
    mental conception and</strike>
      <strike>they called her</strike> spiritual conception &lt;of man &lt;brought forth&gt; a material man &lt;and was&gt; <strike>but</strike> a&gt; <strike>matter, flesh and
    bones</strike> &lt;miraculous conception (<strike>and</strike>&lt;but&gt; it was n<strike>either</strike>&lt;ot&gt;) and&gt; and<sup>
                <a shape="rect" class="sdfootnoteanc" name="sdfootnote138anc"
                   href="#sdfootnote138sym">
                   <sup>138</sup>
      </a>
      </sup>
    </p>

    The processing above is done with the document.evaluate method of Javascript.

    A solution with only an XSLT processor and text manipulation could look as follows:

    1. In the input file, replace &lt; with <ins> and &gt; with </ins> using a tool like sed.
    2. Transform the input file (with these replacements) using the following XSL transformation. Again, the template with match="ins" must be adapted according to what you want to achieve.
    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
      <xsl:template match="@*|node()">
        <xsl:copy>
          <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
      </xsl:template>
      <xsl:template match="ins">
        <span style="display: inline-block; border: 1px solid black; margin: 1px;">
          <xsl:apply-templates/>
        </span>
      </xsl:template>
    </xsl:stylesheet>
    
    Login or Signup to reply.
  2. Just as a note, in addition to Heiko’s answer, if you use a current version of Saxon supporting XSLT 3, you can start with a named template (-it as the command option) xsl:initial-template (and no -s source option), do the regular expression string manipulation after reading the file with unparsed-text and then parse the markup with parse-xml or parse-xml-fragment, as needed, and push it through templates, for further processing:

    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
      version="3.0"
      xmlns:xs="http://www.w3.org/2001/XMLSchema"
      exclude-result-prefixes="#all">
      
      <xsl:param name="input-html" as="xs:string">sample1.html</xsl:param>
      
      <xsl:param name="regex1" as="xs:string">&amp;lt;</xsl:param>
      <xsl:param name="regex2" as="xs:string">&amp;gt;</xsl:param>
      
      <xsl:param name="element-name" as="xs:string" static="yes" select="'ins'"/>
      
      <xsl:param name="start-tag" as="xs:string" select="'&lt;' || $element-name || '>'"/>
      <xsl:param name="end-tag" as="xs:string" select="'&lt;/' || $element-name || '>'"/>
      
      <xsl:variable name="replaced-html" select="$input-html => unparsed-text() => replace($regex1, $start-tag) => replace($regex2, $end-tag) => parse-xml-fragment()"/>  
    
      <xsl:mode on-no-match="shallow-copy"/>
    
      <xsl:template name="xsl:initial-template">
        <xsl:apply-templates select="$replaced-html"/>
      </xsl:template>
      
      <xsl:template _match="{$element-name}">
        <span>
          <xsl:apply-templates/>
        </span>
      </xsl:template>
      
    </xsl:stylesheet>
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search