Unconditionally stop scraping HTML at specified element (or EOF)

DavidChung
October 18, 2024
91 views
1 vote
2 Answers

I’m using Python lxml.html package to scrape an HTML file. The HTML I’m trying to scrape reads in part

<h1>Description of DAB Ensemble 1</h1><table>Stuff I don't care about</table>
    <!-- Tags I don't care about -->
    <div id="announcement_data_block">
        <h3>Announcement information</h3>
        <p>No announcement information is broadcast</p>
    </div>
    <!-- More tags I don't care about -->
<h1>Description of DAB Ensemble 2</h1><table>Stuff I don't care about</table>
    <!-- Tags I don't care about -->
    <div id="announcement_data_block">
        <h3>Announcement information</h3>
            <h4>Announcement switching (FIG0/19)</h4>
                <table>Stuff I DO care about</table>
    </div>
    <!-- More tags I don't are about -->

I’m interested in the "Announcement switching" table, which may or may not be present for a given DAB ensemble. I have a lxml.hmtl.xpath expression as follows:

f'//h1[text()="Description of DAB Ensemble {ens_idx}"]/following-sibling::table/following-sibling::div[@id="announcement_data_block"]/h4[starts-with(text(), "Announcement switching")]/following-sibling::table'

Per my understanding, this XPath statement is saying, for a given ens_idx value:

Start at root and find a h1 tag with text matching "Description of DAB Ensemble {ens_idx}" (e.g "Description of DAB Ensemble 1", "Description of DAB Ensemble 2"), then go to the first table you see after that. In the above example, it would be the table labelled "Stuff I don’t care about". Afterwards, go to the next div whose id is "announcement_data_block". Within that div, find a h4 tag whose text starts with "Announcement switching". Get the first table following that.

In the example above, DAB Ensemble 1 does not have such a table. I would want xpath to return None when attempting to get the table for DAB Ensemble 1. However, xpath doesn’t know to stop when it hits the h1 tag "Description of DAB Ensemble 2", so it keeps going until it finds DAB Ensemble 2’s h4 tag. I’m looking for help in finding a xpath statement that will have XPath unconditionally stop at the next "Description of DAB Ensemble" h1 tag. Essentially I wish to modify the directive to:

Start at root and find a h1 tag with text matching "Description of DAB Ensemble {ens_idx}" (e.g "Description of DAB Ensemble 1", "Description of DAB Ensemble 2"), then go to the first table you see after that. In the above example, it would be the table labelled "Stuff I don’t care about". Afterwards, go to the next div whose id is "announcement_data_block". Within that div, find a h4 tag whose text starts with "Announcement switching". Get the first table following that. If this criteria is not found before the h1 tag with text matching "Description of DAB Ensemble {ens_idx + 1}" or EOF, then return None.

The part in bold is what is missing from my XPath expression. Does anyone know how to construct such an expression?

Answers

Chosen as BEST ANSWER
- DavidChung
- October 18, 2024 at 12:24 am
- 0 votes
0
While LMC's solution did work, it assumes that the div "announcement_data_block" is always present. This is not the case; announcement_data_block may or may not be there. The solution I ended up going with is to count the number of h1 headers containing the text "Description of DAB Ensemble " that precedes the element I find. So if you’re looking for an element in Ensemble 1, “Description of DAB Ensemble “ should have appeared once as part of the header “Description of DAB Ensemble 1”. If you’re looking for an element in Ensemble 2, “Description of DAB Ensemble “ should have appeared twice as part of headers “Description of DAB Ensemble 1” and “Description of DAB Ensemble 2”. And so on. My xpath search query ended up looking as follows:
```
f'//h1[text()="Description of DAB Ensemble {ens_idx}"]/following-sibling::table[1]/following-sibling::div[@id="announcement_data_block" and count(preceding-sibling::h1[starts-with(text(), "Description of DAB Ensemble ")])={ens_idx}]/h4[starts-with(text(), "Announcement switching")]/following-sibling::table'
```

(Edit)

- LMC
- October 17, 2024 at 6:09 pm
- 0 votes
0
As it looks from the sample H1s and divs with id are all siblings so
search should indicate that the first following::table and the first @id="announcement_data_block" found are required as

f'//h1[text()="Description of DAB Ensemble {ens_idx}"]/following-sibling::table[1]/following-sibling::div[@id="announcement_data_block"][1]/h4[starts-with(text(), "Announcement switching")]/following-sibling::table'

BTW: ids should NOT be duplicated.

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.