I’m trying to extract a subset of some HTML from a larger file and then perform a few transformations of the result. I’ve made some progress but I’m missing a piece or two to make this work as desired.
Here is a greatly simplified version of the XHTML I wish to transform:
<html>
<head>
<!-- lots of stuff I don't care about -->
</head>
<body>
<div>
<!-- lots of stuff I don't care about -->
<div>
<!-- lots of stuff I don't care about -->
<div id="key_div">
<div id="ignore_this">
<!-- lots of stuff I don't care about -->
</div>
<p>More junk I don't want</p>
<p>Even more junk I don't want</p>
<h2><span class="someClass" id="someID">Header</span></h2>
<p>Stuff I want to keep</p>
<!-- A lot of stuff I want to keep -->
<p>More stuff I want to keep</p>
<ul>
<li><a href="/some/old/path">Fun Place</a></li>
<li><a href="/some/old/other">Better Place</a></li>
</ul>
</div>
<!-- lots of stuff I don't care about -->
</div>
<!-- lots of stuff I don't care about -->
</div>
</body>
</html>
I want to extract everything from the <h2>
tag through the rest of the content inside the <div>
with the id
of "key_div"
. But I also want to transform the <h2>
into a simpler <h1>
and I need to modify the href
s in the list. The final result should look like this:
<html>
<head>
<!-- My own header stuff -->
</head>
<body>
<h1>Header</h1>
<p>Stuff I want to keep</p>
<!-- A lot of stuff I want to keep -->
<p>More stuff I want to keep</p>
<ul>
<li><a href="/new/path">Fun Place</a></li>
<li><a href="/new/other">Better Place</a></li>
</ul>
</body>
</html>
I was able to do most of the basic extraction without any of the desired transformations by using the following XSL:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:x="http://www.w3.org/1999/xhtml"
exclude-result-prefixes="x">
<xsl:output indent="yes" encoding="utf-8"/>
<xsl:template match="/">
<html>
<head>
<title>My Title</title>
</head>
<body>
<xsl:apply-templates/>
</body>
</html>
</xsl:template>
<xsl:template match="div[@id='key_div']/*">
<xsl:copy-of select="."/>
</xsl:template>
<xsl:template match="div[@id='ignore_this']"/>
<xsl:template match="text()"/>
</xsl:stylesheet>
This results in:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>My Title</title>
</head>
<body>
<p>More junk I don't want</p>
<p>Even more junk I don't want</p>
<h2><span class="someClass" id="someID">Header</span></h2>
<p>Stuff I want to keep</p>
<p>More stuff I want to keep</p>
<ul>
<li><a href="/some/old/path">Fun Place</a></li>
<li><a href="/some/old/other">Better Place</a></li>
</ul>
</body>
</html>
I don’t know how to remove the stuff before the <h2>
.
I don’t know how to transform <h2><span class="someClass" id="someID">Header</span></h2>
into <h1>Header</h1>
or how to transform the href
s. All of my attempts to combine a transform with the extraction usually ends up with no content.
There will be a few other transformations I need to perform but for now I’ll focus on this example to get me started. I mention it so any possible answers don’t prevent any other possible transformations.
2
Answers
Assuming the HTML input elements are in no namespace (as in your samples although you speak about XHTML) it suffices to use the identity transformation (declared in XSLT 3 through
<xsl:mode on-no-match="shallow-copy"/>
) plus templates for the body, the h2, the h2/span and the href attribute:Using XSLT 2/3, XPath 2/3 in some places, XSLT 1/XPath 1 is
With the identity transformation added it would be
Try something like: