I am processing xhtml documents into TEI. The xhtml docs were initially old .doc files. The folks that created the doc files used <> chars to indicate when a letter, word or words were added to the orginal document. Here is a snippet of the document.
<p>
<X> presented <<strike>it</strike>>
to others <X> in the form of a babe, [ERASED] X [END ERASE] <at
first — and according to their belief her> <strike>her
mental conception and</strike>
<strike>they called her</strike>
spiritual conception <of man <brought forth> a material man
<and was> <strike>but</strike>
a> <strike>matter, flesh and
bones</strike> <miraculous
conception (<strike>and</strike><but>
it was n<strike>either</strike><ot>)
and> and<sup>
<a shape="rect" class="sdfootnoteanc" name="sdfootnote138anc"
href="#sdfootnote138sym">
<sup>138</sup>
</a>
</sup>
</p>
I’m using xsl to process the xhtml, and the function
<xsl:when test="matches($text, '<.+>', 's')">
Which works as long as there aren’t any nested pairs. The text below an example of nesting from the sample above.
<of man <brought forth> a material man <and was> <strike>but</strike> a>
The matches() function fails because for the first < it grabs the first > it see’s which is a match for the second < not the first. After that things go downhill.
I was considering using sed and regex to replace the &alt > pairs but I don’t see how that wouldn’t fail for the same reason.
Looking for a good idea on how to tackle this.
thanks,
Scott
2
Answers
The approach illustrated below uses text replacement to convert the
<>
angles into opening and closingins
tags. Then it parses the document and processes the<ins>
elements in reverse document order so that insertions that are nested into other insertions are processed before the outer ones. This processing consists of text replacement oninnerHTML
, so that all children of an<ins>
element can be treated together, whether they are text nodes or other elements.Which text replacement you use depends on what you want to achieve. In the example, insertions are converted into
<span>
elements with a border so that the effect of nesting can be seen.The processing above is done with the
document.evaluate
method of Javascript.A solution with only an XSLT processor and text manipulation could look as follows:
<
with<ins>
and>
with</ins>
using a tool like sed.match="ins"
must be adapted according to what you want to achieve.Just as a note, in addition to Heiko’s answer, if you use a current version of Saxon supporting XSLT 3, you can start with a named template (
-it
as the command option)xsl:initial-template
(and no-s
source option), do the regular expression string manipulation after reading the file withunparsed-text
and then parse the markup withparse-xml
orparse-xml-fragment
, as needed, and push it through templates, for further processing: