skip to Main Content

I’ve been using BeautifulSoup to parse HTML documents, but running into issues with HTML tags being embedded into a single word e.g. <p> Hell <font> o </font> </p>. So BeautifulSoup splits those up. I am trying to figure out how I can keep words together, to be honest I am not sure it’s possible, but obviously when the HTML is rendered to the eye it all looks correct.

Things I’ve considered:

  1. Don’t use a separator but that breaks down since sometimes tags are used to separate paragraphs.
  2. Treat all tags as special cases and don’t use a separate for them. This doesn’t work since different tags are used differently. Sometimes it’s other times it’s , there isn’t much consistency.

Any ideas would be appreciated. Thanks!

from bs4 import BeautifulSoup
import html

data = """<p style='text-align:justify;margin-top:0pt;margin-bottom:0pt;line-height:12pt;' >
      <font style='font-family:Times New Roman;font-size:11pt;font-style:italic;margin-left:0pt;' >Accounting Pronouncements Adopted</font>
    </p>
    <p style='text-align:justify;line-height:12pt;' ></p>
    <p style='text-align:left;margin-top:0pt;margin-bottom:0pt;line-height:12pt;' >
      <font style='font-family:Times New Roman;font-size:11pt;margin-left:18pt;' >In March 2016, the Financial Accounting Standard Board (&amp;#8220;FASB&amp;#8221;) issued Accounting Standards Update (&amp;#8220;ASU&amp;#8221;) No. 2016-09, &amp;#8220;Stock Compensation&amp;#8221; (Topic 718) (&amp;#8220;ASU 2016-09&amp;#8221;).  ASU 2016-09 contains amended guidance for </font>
      <font style='font-family:Times New Roman;font-size:11pt;' >share-based payment accounting.  We adopted the provisions of this standard during the first quarter of 2017.  </font>
    </p>
    <p style='text-align:left;line-height:12pt;' ></p>
    <p style='text-align:left;margin-top:0pt;margin-bottom:0pt;line-height:12pt;' >
      <font style='font-family:Times New Roman;font-size:11pt;margin-left:18pt;' >Under ASU 2016-09, all excess tax benefits and tax deficiencies resulting from the difference between the deduction for tax purposes and the st</font>
      <font style='font-family:Times New Roman;font-size:11pt;' >ock-based compensation cost recognized for financial reporting purposes are included as a component of income tax expense as of January 1, 2017.  Prior to the implementation of ASU 2016-09, excess tax benefits w</font>
      <font style='font-family:Times New Roman;font-size:11pt;' >ere recorded as a component of A</font>
      <font style='font-family:Times New Roman;font-size:11pt;' >dditional pai</font>
      <font style='font-family:Times New Roman;font-size:11pt;' >d-</font>
      <font style='font-family:Times New Roman;font-size:11pt;' >in capital and tax deficiencies were recognized either as an offset to accumulated excess tax benefits or in the income statement if there were no accumulated excess tax benefits.  The adoption of ASU 2016-09 reduced income tax expense by approximately $</font>
      <font style='font-family:Times New Roman;font-size:11pt;' >19.</font>
      <font style='font-family:Times New Roman;font-size:11pt;' >6</font>
      <font style='font-family:Times New Roman;font-size:11pt;' > million for the </font>
      <font style='font-family:Times New Roman;font-size:11pt;' >year</font>
      <font style='font-family:Times New Roman;font-size:11pt;' > ended </font>
      <font style='font-family:Times New Roman;font-size:11pt;' >Decem</font>
      <font style='font-family:Times New Roman;font-size:11pt;' >ber 30, 2017.</font>
    </p>"""



soup = BeautifulSoup(data, features="html.parser") 
       
result = soup.get_text(separator=" ").strip().strip('n')
result = html.unescape(result)

print(result)

Result:

Accounting Pronouncements Adopted 
 
 
 
 In March 2016, the Financial Accounting Standard Board (“FASB”) issued Accounting Standards Update (“ASU”) No. 2016-09, “Stock Compensation” (Topic 718) (“ASU 2016-09”).  ASU 2016-09 contains amended guidance for  
 share-based payment accounting.  We adopted the provisions of this standard during the first quarter of 2017.   
 
 
 
 Under ASU 2016-09, all excess tax benefits and tax deficiencies resulting from the difference between the deduction for tax purposes and the st 
 ock-based compensation cost recognized for financial reporting purposes are included as a component of income tax expense as of January 1, 2017.  Prior to the implementation of ASU 2016-09, excess tax benefits w 
 ere recorded as a component of A 
 dditional pai 
 d- 
 in capital and tax deficiencies were recognized either as an offset to accumulated excess tax benefits or in the income statement if there were no accumulated excess tax benefits.  The adoption of ASU 2016-09 reduced income tax expense by approximately $ 
 19. 
 6 
  million for the  
 year 
  ended  
 Decem 
 ber 30, 2017.

2

Answers


  1. Here is one possible solution how you can handle the text with beautifulsoup:

    import html
    import re
    
    from bs4 import BeautifulSoup, NavigableString
    
    data = """<p style='text-align:justify;margin-top:0pt;margin-bottom:0pt;line-height:12pt;' >
          <font style='font-family:Times New Roman;font-size:11pt;font-style:italic;margin-left:0pt;' >Accounting Pronouncements Adopted</font>
        </p>
        <p style='text-align:justify;line-height:12pt;' ></p>
        <p style='text-align:left;margin-top:0pt;margin-bottom:0pt;line-height:12pt;' >
          <font style='font-family:Times New Roman;font-size:11pt;margin-left:18pt;' >In March 2016, the Financial Accounting Standard Board (&amp;#8220;FASB&amp;#8221;) issued Accounting Standards Update (&amp;#8220;ASU&amp;#8221;) No. 2016-09, &amp;#8220;Stock Compensation&amp;#8221; (Topic 718) (&amp;#8220;ASU 2016-09&amp;#8221;).  ASU 2016-09 contains amended guidance for </font>
          <font style='font-family:Times New Roman;font-size:11pt;' >share-based payment accounting.  We adopted the provisions of this standard during the first quarter of 2017.  </font>
        </p>
        <p style='text-align:left;line-height:12pt;' ></p>
        <p style='text-align:left;margin-top:0pt;margin-bottom:0pt;line-height:12pt;' >
          <font style='font-family:Times New Roman;font-size:11pt;margin-left:18pt;' >Under ASU 2016-09, all excess tax benefits and tax deficiencies resulting from the difference between the deduction for tax purposes and the st</font>
          <font style='font-family:Times New Roman;font-size:11pt;' >ock-based compensation cost recognized for financial reporting purposes are included as a component of income tax expense as of January 1, 2017.  Prior to the implementation of ASU 2016-09, excess tax benefits w</font>
          <font style='font-family:Times New Roman;font-size:11pt;' >ere recorded as a component of A</font>
          <font style='font-family:Times New Roman;font-size:11pt;' >dditional pai</font>
          <font style='font-family:Times New Roman;font-size:11pt;' >d-</font>
          <font style='font-family:Times New Roman;font-size:11pt;' >in capital and tax deficiencies were recognized either as an offset to accumulated excess tax benefits or in the income statement if there were no accumulated excess tax benefits.  The adoption of ASU 2016-09 reduced income tax expense by approximately $</font>
          <font style='font-family:Times New Roman;font-size:11pt;' >19.</font>
          <font style='font-family:Times New Roman;font-size:11pt;' >6</font>
          <font style='font-family:Times New Roman;font-size:11pt;' > million for the </font>
          <font style='font-family:Times New Roman;font-size:11pt;' >year</font>
          <font style='font-family:Times New Roman;font-size:11pt;' > ended </font>
          <font style='font-family:Times New Roman;font-size:11pt;' >Decem</font>
          <font style='font-family:Times New Roman;font-size:11pt;' >ber 30, 2017.</font>
        </p>"""
    
    
    soup = BeautifulSoup(data, features="html.parser")
    
    # unwrap tags that we don't need
    for tag in soup.select("font, b, i, span"):
        tag.unwrap()
    
    # "join" NavigableStrings in tag.content together
    soup.smooth()
    
    # in <p> tags, remove n to "handle" them like in browser
    for p in soup.select("p"):
        for i, c in enumerate(p.contents):
            if isinstance(c, NavigableString):
                p.contents[i].replace_with(re.sub(r"n+", "", c))
    
    result = soup.get_text(strip=True, separator="nn")
    result = html.unescape(result)
    
    print(result)
    

    Prints:

    Accounting Pronouncements Adopted
    
    In March 2016, the Financial Accounting Standard Board (“FASB”) issued Accounting Standards Update (“ASU”) No. 2016-09, “Stock Compensation” (Topic 718) (“ASU 2016-09”).  ASU 2016-09 contains amended guidance for share-based payment accounting.  We adopted the provisions of this standard during the first quarter of 2017.
    
    Under ASU 2016-09, all excess tax benefits and tax deficiencies resulting from the difference between the deduction for tax purposes and the stock-based compensation cost recognized for financial reporting purposes are included as a component of income tax expense as of January 1, 2017.  Prior to the implementation of ASU 2016-09, excess tax benefits were recorded as a component of Additional paid-in capital and tax deficiencies were recognized either as an offset to accumulated excess tax benefits or in the income statement if there were no accumulated excess tax benefits.  The adoption of ASU 2016-09 reduced income tax expense by approximately $19.6 million for the year ended December 30, 2017.
    
    Login or Signup to reply.
  2. Try soup.stripped_strings, it should return all text in tags stripped as a list.

    If you do ''.join(soup.stripped_strings) you’ll get all texts in a single line.

    Output:

    Accounting Pronouncements AdoptedIn March 2016, the Financial Accounting Standard Board (&#8220;FASB&#8221;) issued Accounting Standards Update (&#8220;ASU&#8221;) No. 2016-09, &#8220;Stock Compensation&#8221; (Topic 718) (&#8220;ASU 2016-09&#8221;).  ASU 2016-09 contains amended guidance forshare-based payment accounting.  We adopted the provisions of this standard during the first quarter of 2017.Under ASU 2016-09, all excess tax benefits and tax deficiencies resulting from the difference between the deduction for tax purposes and the stock-based compensation cost recognized for financial reporting purposes are included as a component of income tax expense as of January 1, 2017.  Prior to the implementation of ASU 2016-09, excess tax benefits were recorded as a component of Additional paid-in capital and tax deficiencies were recognized either as an offset to accumulated excess tax benefits or in the income statement if there were no accumulated excess tax benefits.  The adoption of ASU 2016-09 reduced income tax expense by approximately $19.6million for theyearendedDecember 30, 2017.
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search