skip to Main Content

I have a HTML table with possibly missing or malformed colspan values:

<table border="1">
  <tbody>
    <tr>
      <th>A</th><th>B</th><th>C</th><th>D</th>
      <th>E</th><th>F</th><th>G</th><th>H</th>
      <th>I</th><th>J</th><th>K</th><th>L</th>
      <th>M</th><th>N</th><th>O</th><th>P</th>
    </tr>
    <tr>
      <td                  > 0</td>
      <td colspan="0"      > 1</td>
      <td colspan="2"      > 2</td>
      <td colspan="-2"     > 3</td>
      <td colspan="+ 2"    > 4</td>
      <td colspan="+2"     > 5</td>
      <td colspan="*2#%@!" > 6</td>
      <td colspan="2.7"    > 7</td>
      <td colspan="-2.3"   > 8</td>
      <td colspan="2e1"    > 9</td>
      <td colspan=" 2 "    >10</td>
    </tr>
  </tbody>
</table>

I would like to get the colspan value for each td using the HTML4~5 specification (I’m currently trying to figure out what the W3C spec tells us about it). For now let’s say that the results of the snippet above are my expected output:

colspan rendered value
missing 1
"0" 1
"2" 2
"-2" 1
"+ 2" 1
"+2" 2
"*2#%@!" 1
"2.7" 2
"-2.3" 1
"2e1" 2
" 2 " 2

How can I achieve it using XPath 3.1?


edit

I’ve written this XPath expression:

//td/( (1, @colspan[. castable as xs:double]) => max() => xs:integer() )

But it converts "2e1" to 20 instead of 2.

2

Answers


  1. Chosen as BEST ANSWER

    For the WHATWG and W3C standards, colspan values should satisfy the regex ^s*+?(d+).*, with the part in parentheses being the effective value. Both standards fallback to 1 when colspan is zero, invalid or missing.

    @kjhughes answer works well and is easy to understand, but here's an other possible solution:

    //td/( max((1, analyze-string(@colspan, "^s*+?d+")/fn:match)) )
    

    ASIDE

    The relevant parts of the standards:

    WHATWG

    Attributes common to td and th elements:

    The td and th elements may have a colspan content attribute specified, whose value must be a valid non-negative integer greater than zero and less than or equal to 1000.

    Rules for parsing non-negative integers:

    1. Let value be the result of parsing input using the rules for parsing integers.

    Rules for parsing integers:

    A string is a valid integer if it consists of one or more ASCII digits, optionally prefixed with a U+002D HYPHEN-MINUS character (-).

    1. Skip ASCII whitespace within input given position.
    1. ... Otherwise, if the character indicated by position (the first character) is a U+002B PLUS SIGN character (+):
      1. Advance position to the next character. (The "+" is ignored, but it is not conforming.)
    1. Collect a sequence of code points that are ASCII digits from input given position, and interpret the resulting sequence as a base-ten integer. Let value be that integer.

    Collect a sequence of code points

    1. While position doesn’t point past the end of input and the code point at position within input meets the condition condition:

      1. Append that code point to the end of result.

    W3C

    Algorithm for processing rows:

    1. If the current cell has a colspan attribute, then parse that attribute's value, and let colspan be the result.

      If parsing that value failed, or returned zero, or if the attribute is absent, then let colspan be 1, instead.

    Rules for parsing non-negative integers:

    1. Skip whitespace.
    1. If the character indicated by position is a U+002B PLUS SIGN character (+), advance position to the next character. (The "+" is ignored, but it is not conforming.)

    8 Collect a sequence of characters in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), and interpret the resulting sequence as a base-ten integer. Let value be that integer.

    Collect a sequence of characters

    While position doesn't point past the end of input and the character at position is one of the characters, append that character to the end of result and advance position to the next character in input.


  2. Consider regex pattern matching the value to extract leading digit characters, ignoring all characters beginning with the first non-digit character. Then successful match yields the leading integer; all else yields 1:

    //td/(if (matches(@colspan,'^s*+?[1-9]d*')) 
              then replace(@colspan, '^s*+?(d+).*$', '$1') 
              else '1')
    

    Update: Now handles question update that added cases, colspan="0" and colspan="+2".

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search