I have the following html script:
<div>
<p class="test1">
<i class="empty"> </i>
WANTED TEXT
</p>
</div>
I want to extract the context of p
tag (WANTED TEXT). My code:
from lxml import etree
from io import StringIO
html_parser = etree.HTMLParser()
tmp_xp = "//p[1]"
selected_tag = etree.parse(StringIO("""<div> <p class="test1"> <i class="empty"> </i> WANTED TEXT </p> </div>"""), html_parser).xpath(tmp_xp)
print(selected_tag[0].text)
The code prints nothing. If I move WANTED TEXT
to before the <i>
tag, the code starts to work fine.
How can I solve this?
2
Answers
With the
selected_tag[0]
use.itertext()
:Prints:
The reason
.text
doesn’t work is that the textWANTED TEXT
(along with the surrounding whitespace) is actually the.tail
of thei
element.In addition to the
.join()
,.itertext()
, and.strip()
shown in another answer, you can also use plain XPath (normalize-space()
).Just change:
to: