skip to Main Content

I have the following html script:

<div>
    <p class="test1"> 
        <i class="empty"> </i> 
        WANTED TEXT 
    </p>
</div>

I want to extract the context of p tag (WANTED TEXT). My code:

from lxml import etree
from io import StringIO
html_parser = etree.HTMLParser()
tmp_xp = "//p[1]"
selected_tag = etree.parse(StringIO("""<div> <p class="test1"> <i class="empty"> </i> WANTED TEXT </p> </div>"""), html_parser).xpath(tmp_xp)
print(selected_tag[0].text)

The code prints nothing. If I move WANTED TEXT to before the <i> tag, the code starts to work fine.

How can I solve this?

2

Answers


  1. With the selected_tag[0] use .itertext():

    print("".join(selected_tag[0].itertext()).strip())
    

    Prints:

    WANTED TEXT
    
    Login or Signup to reply.
  2. The reason .text doesn’t work is that the text WANTED TEXT (along with the surrounding whitespace) is actually the .tail of the i element.

    In addition to the .join(), .itertext(), and .strip() shown in another answer, you can also use plain XPath (normalize-space()).

    Just change:

    print(selected_tag[0].text)
    

    to:

    print(selected_tag[0].xpath("normalize-space()"))
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search