I have and html file and want to get the list of lines from it.
<html><head></head><body>subject marks
<div dir="ltr">marks<br/><br/><div class="gmail_quote"><div class="gmail_attr" dir="ltr"> Forwarded message <br/>From: <strong class="gmail_sendername" dir="auto"></strong> <span dir="auto"><<a href="mailto:"></a>></span><br/>Date: Wed, Nov3 at 1:11 PM<br/>Subject:Marks<br/>To: sk <<a href="sh.com">[email protected]</a>>, <<a href="mailto:"></a>><br/></div><br/><br/><div dir="ltr">1)Physics 50<div> 2)Maths 46</div><div> 3)Chemistrry 42</div><div> 4)geography 49</div><div> 5)History 40</div> 7)English - 42<div><br/></div><div><div><div data-smartmail="gmail_signature" dir="ltr"><div dir="ltr"><div><div dir="ltr"><div><div><font size="2"><span style="color:rgb(0,0,255)"><b>REGARDS,<br/></b></span></font></div><font size="2"><span style="color:rgb(0,0,255)"><b>AGR<br/></b></span></font></div><font size="2"><span style="color:rgb(0,0,255)"><b>ljs</b></span></font></div><div dir="ltr"><font size="2"><span style="color:rgb(0,0,255)"><b>988ss79808</b></span></font></div><div dir="ltr"><font size="2"><span style="color:rgb(0,0,255)"><b>9226djgf8468</b></span></font><br/></div></div></div></div></div></div></div>
</div></div>
<br/>
<div></body></html>
I want to get the list of lines from it.
Please note: I want it to be a generic one since I want to extract lines even if there is a table in it. I will not split on some given tags like <td>
, <a>
.
In this subject 2 to subject 5 are in a nested div tags which is a child div of div having subject 1 and subject 6
If anyone could help in to parse this into lines, other than using text.splitlines()
beacause that splits on all tags.
Output Expected is list of lines
My approach: I have used
"""
for element in soup.find_all():
if element.name not in [‘tr’, ‘a’]:
element.append(‘n’)
"""
this gives subject 1 and 2 in a single line
2
Answers
Try this
The above code can be used like this
Hope this is what you are looking for.
I just had a try with BeautifoulSoup
soup.get_text('n')
and got this text result:As you can see, it’s not what we would expect 🙁
Then had a go with html2text,
running
html2text.html2text(html)
and wasn’t totally happy:The output seems better and looks like some kind of MarkDown
language as you can see the
**
for bold and the links[](mailto:)
. You could perhaps play with thehtml2text options to get the desired output.
If not, I would consider using a better rendering engine such as
wkhtmltopdf to create a PDF.
This PDF could then be converted back to text with
pdftotext. I tried it out:
And got this text output:
In my opinion, it looks more like the real browser rendering.
You’ll then be able to split lines.