I have a HTML table with possibly missing or malformed colspan
values:
<table border="1">
<tbody>
<tr>
<th>A</th><th>B</th><th>C</th><th>D</th>
<th>E</th><th>F</th><th>G</th><th>H</th>
<th>I</th><th>J</th><th>K</th><th>L</th>
<th>M</th><th>N</th><th>O</th><th>P</th>
</tr>
<tr>
<td > 0</td>
<td colspan="0" > 1</td>
<td colspan="2" > 2</td>
<td colspan="-2" > 3</td>
<td colspan="+ 2" > 4</td>
<td colspan="+2" > 5</td>
<td colspan="*2#%@!" > 6</td>
<td colspan="2.7" > 7</td>
<td colspan="-2.3" > 8</td>
<td colspan="2e1" > 9</td>
<td colspan=" 2 " >10</td>
</tr>
</tbody>
</table>
I would like to get the colspan
value for each td
using the HTML4~5 specification (I’m currently trying to figure out what the W3C spec tells us about it). For now let’s say that the results of the snippet above are my expected output:
colspan |
rendered value |
---|---|
missing | 1 |
"0" |
1 |
"2" |
2 |
"-2" |
1 |
"+ 2" |
1 |
"+2" |
2 |
"*2#%@!" |
1 |
"2.7" |
2 |
"-2.3" |
1 |
"2e1" |
2 |
" 2 " |
2 |
How can I achieve it using XPath 3.1?
edit
I’ve written this XPath expression:
//td/( (1, @colspan[. castable as xs:double]) => max() => xs:integer() )
But it converts "2e1"
to 20
instead of 2
.
2
Answers
For the WHATWG and W3C standards,
colspan
values should satisfy the regex^s*+?(d+).*
, with the part in parentheses being the effective value. Both standards fallback to1
whencolspan
is zero, invalid or missing.@kjhughes answer works well and is easy to understand, but here's an other possible solution:
ASIDE
The relevant parts of the standards:
WHATWG
Attributes common to
td
andth
elements:Rules for parsing non-negative integers:
Rules for parsing integers:
Collect a sequence of
code points
W3C
Algorithm for processing rows:
Rules for parsing non-negative integers:
Collect a sequence of characters
Consider regex pattern matching the value to extract leading digit characters, ignoring all characters beginning with the first non-digit character. Then successful match yields the leading integer; all else yields 1:
Update: Now handles question update that added cases,
colspan="0"
andcolspan="+2"
.