skip to Main Content

I have and html file and want to get the list of lines from it.

<html><head></head><body>subject marks
    <div dir="ltr">marks<br/><br/><div class="gmail_quote"><div class="gmail_attr" dir="ltr">  Forwarded message  <br/>From: <strong class="gmail_sendername" dir="auto"></strong> <span dir="auto">&lt;<a href="mailto:"></a>&gt;</span><br/>Date: Wed, Nov3 at 1:11 PM<br/>Subject:Marks<br/>To: sk &lt;<a href="sh.com">[email protected]</a>&gt;, &lt;<a href="mailto:"></a>&gt;<br/></div><br/><br/><div dir="ltr">1)Physics 50<div> 2)Maths 46</div><div> 3)Chemistrry 42</div><div> 4)geography 49</div><div> 5)History 40</div> 7)English - 42<div><br/></div><div><div><div data-smartmail="gmail_signature" dir="ltr"><div dir="ltr"><div><div dir="ltr"><div><div><font size="2"><span style="color:rgb(0,0,255)"><b>REGARDS,<br/></b></span></font></div><font size="2"><span style="color:rgb(0,0,255)"><b>AGR<br/></b></span></font></div><font size="2"><span style="color:rgb(0,0,255)"><b>ljs</b></span></font></div><div dir="ltr"><font size="2"><span style="color:rgb(0,0,255)"><b>988ss79808</b></span></font></div><div dir="ltr"><font size="2"><span style="color:rgb(0,0,255)"><b>9226djgf8468</b></span></font><br/></div></div></div></div></div></div></div>
    </div></div>
    
    <br/>
    <div></body></html>

I want to get the list of lines from it.
Please note: I want it to be a generic one since I want to extract lines even if there is a table in it. I will not split on some given tags like <td>, <a>.

In this subject 2 to subject 5 are in a nested div tags which is a child div of div having subject 1 and subject 6

If anyone could help in to parse this into lines, other than using text.splitlines() beacause that splits on all tags.

Output Expected is list of lines

My approach: I have used
"""

for element in soup.find_all():
if element.name not in [‘tr’, ‘a’]:
element.append(‘n’)

"""

this gives subject 1 and 2 in a single line

2

Answers


  1. Try this

    import re
    
    def extract_lines(html):
    
      # Remove all whitespace between tags.
      html = re.sub(r">s*<", "><", html)
    
      # Split the HTML string into lines.
      lines = html.splitlines()
    
      # Return a list of strings, each representing a line of text.
      return [line.strip() for line in lines]
    

    The above code can be used like this

    html = """
    <html><head></head><body>subject marks
    <div dir="ltr">marks<br/><br/><div class="gmail_quote"><div class="gmail_attr" dir="ltr">  Forwarded message  <br/>From: <strong class="gmail_sendername" dir="auto"></strong> <span dir="auto">&lt;<a href="mailto:"></a>&gt;</span><br/>Date: Wed, Nov3 at 1:11 PM<br/>Subject:Marks<br/>To: sk &lt;<a href="sh.com">[email protected]</a>&gt;, &lt;<a href="mailto:"></a>&gt;<br/></div><br/><br/><div dir="ltr">1)Physics 50<div> 2)Maths 46</div><div> 3)Chemistrry 42</div><div> 4)geography 49</div><div> 5)History 40</div> 7)English - 42<div><br/></div><div><div><div data-smartmail="gmail_signature" dir="ltr"><div dir="ltr"><div><div dir="ltr"><div><div><font size="2"><span style="color:rgb(0,0,255)"><b>REGARDS,<br/></b></span></font></div><font size="2"><span style="color:rgb(0,0,255)"><b>AGR<br/></b></span></font></div><font size="2"><span style="color:rgb(0,0,255)"><b>ljs</b></span></font></div><div dir="ltr"><font size="2"><span style="color:rgb(0,0,255)"><b>988ss79808</b></span></font></div><div dir="ltr"><font size="2"><span style="color:rgb(0,0,255)"><b>9226djgf8468</b></span></font><br/></div></div></div></div></div></div></div>
    </div></div>
    
    <br/>
    <div></body></html>
    """
    
    lines = extract_lines(html)
    
    print(lines)
    

    Hope this is what you are looking for.

    Login or Signup to reply.
  2. I just had a try with BeautifoulSoup soup.get_text('n') and got this text result:

    subject marks
    
    marks
      Forwarded message
    From:
    
    <
    >
    Date: Wed, Nov3 at 1:11 PM
    Subject:Marks
    To: sk <
    [email protected]
    >, <
    >
    1)Physics 50
     2)Maths 46
     3)Chemistrry 42
     4)geography 49
     5)History 40
     7)English - 42
    REGARDS,
    AGR
    ljs
    988ss79808
    9226djgf8468
    

    As you can see, it’s not what we would expect 🙁

    Then had a go with html2text,
    running html2text.html2text(html) and wasn’t totally happy:

    subject marks
    
    marks
    
    
    Forwarded message
    From: **** <[](mailto:)>
    Date: Wed, Nov3 at 1:11 PM
    Subject:Marks
    To: sk <[[email protected]](sh.com)>, <[](mailto:)>
    
    
    
    
    1)Physics 50
    
    2)Maths 46
    
    3)Chemistrry 42
    
    4)geography 49
    
    5)History 40
    
    7)English - 42
    
    
    
     **REGARDS,
    **
    
     **AGR
    **
    
     **ljs**
    
     **988ss79808**
    
     **9226djgf8468**
    

    The output seems better and looks like some kind of MarkDown
    language as you can see the ** for bold and the links
    [](mailto:). You could perhaps play with the
    html2text options to get the desired output.

    If not, I would consider using a better rendering engine such as
    wkhtmltopdf to create a PDF.

    This PDF could then be converted back to text with
    pdftotext. I tried it out:

    wkhtmltopdf source.html rendered.pdf
    pdftotext -eol unix -layout -nopgbrk rendered.pdf rendered.txt
    

    And got this text output:

    subject marks
    marks
    
    Forwarded message
    From: <>
    Date: Wed, Nov3 at 1:11 PM
    Subject:Marks
    To: sk <[email protected]>, <>
    
    1)Physics 50
    2)Maths 46
    3)Chemistrry 42
    4)geography 49
    5)History 40
    7)English - 42
    
    REGARDS,
    AGR
    ljs
    988ss79808
    9226djgf8468
    

    In my opinion, it looks more like the real browser rendering.
    You’ll then be able to split lines.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search