lxml library doesn't extract text in a given html tag when there is another tag with the text

Minions
August 2, 2023
217 views
1 vote
2 Answers

I have the following html script:

<div>
    <p class="test1"> 
        <i class="empty"> </i> 
        WANTED TEXT 
    </p>
</div>

I want to extract the context of p tag (WANTED TEXT). My code:

from lxml import etree
from io import StringIO
html_parser = etree.HTMLParser()
tmp_xp = "//p[1]"
selected_tag = etree.parse(StringIO("""<div> <p class="test1"> <i class="empty"> </i> WANTED TEXT </p> </div>"""), html_parser).xpath(tmp_xp)
print(selected_tag[0].text)

The code prints nothing. If I move WANTED TEXT to before the <i> tag, the code starts to work fine.

How can I solve this?

Answers

- AndrejKesely
- August 2, 2023 at 10:35 pm
- 0 votes
0
With the selected_tag[0] use .itertext():
```
print("".join(selected_tag[0].itertext()).strip())
```
Prints:
```
WANTED TEXT
```
Login or Signup to reply.

- DanielHaley
- August 2, 2023 at 11:30 pm
- 0 votes
0
The reason .text doesn’t work is that the text WANTED TEXT (along with the surrounding whitespace) is actually the .tail of the i element.

In addition to the .join(), .itertext(), and .strip() shown in another answer, you can also use plain XPath (normalize-space()).

Just change:
```
print(selected_tag[0].text)
```
to:
```
print(selected_tag[0].xpath("normalize-space()"))
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.