skip to Main Content

I am trying to find XPATH to fetch "td" elements between "h2" tag and "h2" tag or between "h2" tag and closing "table" tag, which ever is immediate.

HTML Code

<html>
  <body>
    <table>
      <tbody>
        <tr>
          <td colspan="2" class="dc-section" >
             <h2>HEADING-1</h2>
          </td>
        </tr>
        <tr>
          <td class="dc-table-name" >Country</td>
          <td class="dc-table-value" >India</td>
        </tr>
        <tr>
          <td class="dc-table-name" >Country</td>
          <td class="dc-table-value" >Nepal</td>
        </tr>
        <tr>
          <td colspan="2" class="dc-section" >
            <h2>HEADING-2</h2>
          </td>
        </tr>
        <tr>
          <td class="dc-table-name" >Country</td>
          <td class="dc-table-value" >USA</td>
        </tr>
        <tr>
          <td class="dc-table-name" >Country</td>
          <td class="dc-table-value" >Canada</td>
        </tr>
      </tbody>
    </table>
  </body>
</html>

Given HEADING-1, need td elements with value "Country"->"India" and "Country"->"Nepal". Given HEADING-2, need td elements with value "Country"->"USA" and "Country"->"Canada".

Tried using below XPATH, but for the given HEADING-1, it selects all "td" values.

How to frame a common XPATH expression that works to fetch "td" elements for both "HEADING-1" and "HEADING-2"?

XPATH (Not working)

//h2[text()='HEADING-1']/following::td

2

Answers


  1. NB: This is a reply to the original (unedited) question.

    Using xmlstarlet
    here but it should be comprehensible (XPath 1.0):

    (Edited 2023-04-26: missed the first part of the question)

    Given HEADING-1, need td elements with value "Country":

    # shellcheck shell=sh
    xmlstarlet select -R -I -t 
      -c '//tr[td/h2/text()="HEADING-1"]/following-sibling::tr[1]/td' 
     file.xml
    

    Output:

    <xsl-select>
      <td class="dc-table-name">Country</td>
      <td class="dc-table-value">India</td>
    </xsl-select>
    

    How to frame a common XPATH expression that works to fetch "td"
    elements for both "HEADING-1" and "HEADING-2"?

    # shellcheck shell=sh
    xmlstarlet select -R -I -t -c '//tr/td[not(h2)]' file.xml
    

    which selects td elements with a tr parent and no h2 children.

    Output:

    <xsl-select>
      <td class="dc-table-name">Country</td>
      <td class="dc-table-value">India</td>
      <td class="dc-table-name">Country</td>
      <td class="dc-table-value">USA</td>
    </xsl-select>
    

    As an alternative, if you want to process the blocks separately, you
    could use the EXSLT
    set:leading
    function:

    # shellcheck shell=sh
    xmlstarlet select 
      -R -I -t 
      -m '//tr/td[@class="dc-section"]' 
        -e 'div' 
          -c 'set:leading(following::td,following::td[@class="dc-section"][1])' 
    file.xml
    

    Output:

    <xsl-select>
      <div>
        <td class="dc-table-name">Country</td>
        <td class="dc-table-value">India</td>
      </div>
      <div>
        <td class="dc-table-name">Country</td>
        <td class="dc-table-value">USA</td>
      </div>
    </xsl-select>
    

    Add a -C option before -t in the last command to get a copy of
    the stylesheet:

    <?xml version="1.0"?>
    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:set="http://exslt.org/sets" xmlns:exslt="http://exslt.org/common" version="1.0" extension-element-prefixes="exslt set">
      <xsl:output omit-xml-declaration="yes" indent="yes"/>
      <xsl:template match="/">
        <xsl-select>
          <xsl:for-each select="//tr/td[@class=&quot;dc-section&quot;]">
            <xsl:element name="div">
              <xsl:copy-of select="set:leading(following::td,following::td[@class=&quot;dc-section&quot;][1])"/>
            </xsl:element>
          </xsl:for-each>
        </xsl-select>
      </xsl:template>
    </xsl:stylesheet>
    
    Login or Signup to reply.
  2. Maybe instead of trying to match the following td, you check the first preceding tr with an h2 to see if it matches…

    //tr[not(td/h2)][preceding-sibling::tr[td/h2][1][starts-with(td/h2,'HEADING-1')]]/td
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search