html parsing without using text.splitlines() custom split in lines using python

Sameer
October 16, 2023
325 views
0 votes
2 Answers

I have and html file and want to get the list of lines from it.

<html><head></head><body>subject marks
    <div dir="ltr">marks<br/><br/><div class="gmail_quote"><div class="gmail_attr" dir="ltr">  Forwarded message  <br/>From: <strong class="gmail_sendername" dir="auto"></strong> <span dir="auto">&lt;<a href="mailto:"></a>&gt;</span><br/>Date: Wed, Nov3 at 1:11 PM<br/>Subject:Marks<br/>To: sk &lt;<a href="sh.com">[email protected]</a>&gt;, &lt;<a href="mailto:"></a>&gt;<br/></div><br/><br/><div dir="ltr">1)Physics 50<div> 2)Maths 46</div><div> 3)Chemistrry 42</div><div> 4)geography 49</div><div> 5)History 40</div> 7)English - 42<div><br/></div><div><div><div data-smartmail="gmail_signature" dir="ltr"><div dir="ltr"><div><div dir="ltr"><div><div><font size="2"><span style="color:rgb(0,0,255)"><b>REGARDS,<br/></b></span></font></div><font size="2"><span style="color:rgb(0,0,255)"><b>AGR<br/></b></span></font></div><font size="2"><span style="color:rgb(0,0,255)"><b>ljs</b></span></font></div><div dir="ltr"><font size="2"><span style="color:rgb(0,0,255)"><b>988ss79808</b></span></font></div><div dir="ltr"><font size="2"><span style="color:rgb(0,0,255)"><b>9226djgf8468</b></span></font><br/></div></div></div></div></div></div></div>
    </div></div>
    
    <br/>
    <div></body></html>

I want to get the list of lines from it.
Please note: I want it to be a generic one since I want to extract lines even if there is a table in it. I will not split on some given tags like <td>, <a>.

In this subject 2 to subject 5 are in a nested div tags which is a child div of div having subject 1 and subject 6

If anyone could help in to parse this into lines, other than using text.splitlines() beacause that splits on all tags.

Output Expected is list of lines

My approach: I have used
"""

for element in soup.find_all():
if element.name not in [‘tr’, ‘a’]:
element.append(‘n’)

"""

this gives subject 1 and 2 in a single line

Answers

Try this

import re

def extract_lines(html):

  # Remove all whitespace between tags.
  html = re.sub(r">s*<", "><", html)

  # Split the HTML string into lines.
  lines = html.splitlines()

  # Return a list of strings, each representing a line of text.
  return [line.strip() for line in lines]

The above code can be used like this

html = """
<html><head></head><body>subject marks
<div dir="ltr">marks<br/><br/><div class="gmail_quote"><div class="gmail_attr" dir="ltr">  Forwarded message  <br/>From: <strong class="gmail_sendername" dir="auto"></strong> <span dir="auto">&lt;<a href="mailto:"></a>&gt;</span><br/>Date: Wed, Nov3 at 1:11 PM<br/>Subject:Marks<br/>To: sk &lt;<a href="sh.com">[email protected]</a>&gt;, &lt;<a href="mailto:"></a>&gt;<br/></div><br/><br/><div dir="ltr">1)Physics 50<div> 2)Maths 46</div><div> 3)Chemistrry 42</div><div> 4)geography 49</div><div> 5)History 40</div> 7)English - 42<div><br/></div><div><div><div data-smartmail="gmail_signature" dir="ltr"><div dir="ltr"><div><div dir="ltr"><div><div><font size="2"><span style="color:rgb(0,0,255)"><b>REGARDS,<br/></b></span></font></div><font size="2"><span style="color:rgb(0,0,255)"><b>AGR<br/></b></span></font></div><font size="2"><span style="color:rgb(0,0,255)"><b>ljs</b></span></font></div><div dir="ltr"><font size="2"><span style="color:rgb(0,0,255)"><b>988ss79808</b></span></font></div><div dir="ltr"><font size="2"><span style="color:rgb(0,0,255)"><b>9226djgf8468</b></span></font><br/></div></div></div></div></div></div></div>
</div></div>

<br/>
<div></body></html>
"""

lines = extract_lines(html)

print(lines)

Hope this is what you are looking for.

- PatrickJanser
- October 16, 2023 at 2:13 pm
- 0 votes
0
I just had a try with BeautifoulSoup soup.get_text('n') and got this text result:
```
subject marks

marks
  Forwarded message
From:

<
>
Date: Wed, Nov3 at 1:11 PM
Subject:Marks
To: sk <
[email protected]
>, <
>
1)Physics 50
 2)Maths 46
 3)Chemistrry 42
 4)geography 49
 5)History 40
 7)English - 42
REGARDS,
AGR
ljs
988ss79808
9226djgf8468
```
As you can see, it’s not what we would expect 🙁

Then had a go with html2text,
running html2text.html2text(html) and wasn’t totally happy:
```
subject marks

marks


Forwarded message
From: **** <[](mailto:)>
Date: Wed, Nov3 at 1:11 PM
Subject:Marks
To: sk <[[email protected]](sh.com)>, <[](mailto:)>




1)Physics 50

2)Maths 46

3)Chemistrry 42

4)geography 49

5)History 40

7)English - 42



 **REGARDS,
**

 **AGR
**

 **ljs**

 **988ss79808**

 **9226djgf8468**
```
The output seems better and looks like some kind of MarkDown
language as you can see the ** for bold and the links
[](mailto:). You could perhaps play with the
html2text options to get the desired output.

If not, I would consider using a better rendering engine such as
wkhtmltopdf to create a PDF.

This PDF could then be converted back to text with
pdftotext. I tried it out:
```
wkhtmltopdf source.html rendered.pdf
pdftotext -eol unix -layout -nopgbrk rendered.pdf rendered.txt
```
And got this text output:
```
subject marks
marks

Forwarded message
From: <>
Date: Wed, Nov3 at 1:11 PM
Subject:Marks
To: sk <[email protected]>, <>

1)Physics 50
2)Maths 46
3)Chemistrry 42
4)geography 49
5)History 40
7)English - 42

REGARDS,
AGR
ljs
988ss79808
9226djgf8468
```
In my opinion, it looks more like the real browser rendering.
You’ll then be able to split lines.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.