I’ve been using BeautifulSoup to parse HTML documents, but running into issues with HTML tags being embedded into a single word e.g. <p> Hell <font> o </font> </p>
. So BeautifulSoup splits those up. I am trying to figure out how I can keep words together, to be honest I am not sure it’s possible, but obviously when the HTML is rendered to the eye it all looks correct.
Things I’ve considered:
- Don’t use a separator but that breaks down since sometimes tags are used to separate paragraphs.
- Treat all tags as special cases and don’t use a separate for them. This doesn’t work since different tags are used differently. Sometimes it’s other times it’s , there isn’t much consistency.
Any ideas would be appreciated. Thanks!
from bs4 import BeautifulSoup
import html
data = """<p style='text-align:justify;margin-top:0pt;margin-bottom:0pt;line-height:12pt;' >
<font style='font-family:Times New Roman;font-size:11pt;font-style:italic;margin-left:0pt;' >Accounting Pronouncements Adopted</font>
</p>
<p style='text-align:justify;line-height:12pt;' ></p>
<p style='text-align:left;margin-top:0pt;margin-bottom:0pt;line-height:12pt;' >
<font style='font-family:Times New Roman;font-size:11pt;margin-left:18pt;' >In March 2016, the Financial Accounting Standard Board (&#8220;FASB&#8221;) issued Accounting Standards Update (&#8220;ASU&#8221;) No. 2016-09, &#8220;Stock Compensation&#8221; (Topic 718) (&#8220;ASU 2016-09&#8221;). ASU 2016-09 contains amended guidance for </font>
<font style='font-family:Times New Roman;font-size:11pt;' >share-based payment accounting. We adopted the provisions of this standard during the first quarter of 2017. </font>
</p>
<p style='text-align:left;line-height:12pt;' ></p>
<p style='text-align:left;margin-top:0pt;margin-bottom:0pt;line-height:12pt;' >
<font style='font-family:Times New Roman;font-size:11pt;margin-left:18pt;' >Under ASU 2016-09, all excess tax benefits and tax deficiencies resulting from the difference between the deduction for tax purposes and the st</font>
<font style='font-family:Times New Roman;font-size:11pt;' >ock-based compensation cost recognized for financial reporting purposes are included as a component of income tax expense as of January 1, 2017. Prior to the implementation of ASU 2016-09, excess tax benefits w</font>
<font style='font-family:Times New Roman;font-size:11pt;' >ere recorded as a component of A</font>
<font style='font-family:Times New Roman;font-size:11pt;' >dditional pai</font>
<font style='font-family:Times New Roman;font-size:11pt;' >d-</font>
<font style='font-family:Times New Roman;font-size:11pt;' >in capital and tax deficiencies were recognized either as an offset to accumulated excess tax benefits or in the income statement if there were no accumulated excess tax benefits. The adoption of ASU 2016-09 reduced income tax expense by approximately $</font>
<font style='font-family:Times New Roman;font-size:11pt;' >19.</font>
<font style='font-family:Times New Roman;font-size:11pt;' >6</font>
<font style='font-family:Times New Roman;font-size:11pt;' > million for the </font>
<font style='font-family:Times New Roman;font-size:11pt;' >year</font>
<font style='font-family:Times New Roman;font-size:11pt;' > ended </font>
<font style='font-family:Times New Roman;font-size:11pt;' >Decem</font>
<font style='font-family:Times New Roman;font-size:11pt;' >ber 30, 2017.</font>
</p>"""
soup = BeautifulSoup(data, features="html.parser")
result = soup.get_text(separator=" ").strip().strip('n')
result = html.unescape(result)
print(result)
Result:
Accounting Pronouncements Adopted
In March 2016, the Financial Accounting Standard Board (“FASB”) issued Accounting Standards Update (“ASU”) No. 2016-09, “Stock Compensation” (Topic 718) (“ASU 2016-09”). ASU 2016-09 contains amended guidance for
share-based payment accounting. We adopted the provisions of this standard during the first quarter of 2017.
Under ASU 2016-09, all excess tax benefits and tax deficiencies resulting from the difference between the deduction for tax purposes and the st
ock-based compensation cost recognized for financial reporting purposes are included as a component of income tax expense as of January 1, 2017. Prior to the implementation of ASU 2016-09, excess tax benefits w
ere recorded as a component of A
dditional pai
d-
in capital and tax deficiencies were recognized either as an offset to accumulated excess tax benefits or in the income statement if there were no accumulated excess tax benefits. The adoption of ASU 2016-09 reduced income tax expense by approximately $
19.
6
million for the
year
ended
Decem
ber 30, 2017.
2
Answers
Here is one possible solution how you can handle the text with
beautifulsoup
:Prints:
Try
soup.stripped_strings
, it should return all text in tags stripped as a list.If you do
''.join(soup.stripped_strings)
you’ll get all texts in a single line.Output: