skip to Main Content

Some html writers use <br> in <p>, which is hard to separate when scraping them with bs4.

Example:

<p>
 part1
 <br> 
 part2
 <br>
 part3
 <br>
 part4
</p>

How to transform it into "real" paragraphs, like:

<p>
 <p>part1</p>
 <p>part2</p>
 <p>part3</p>
 <p>part4</p>
</p>

3

Answers


  1. Chosen as BEST ANSWER

    Recall that <br> is used inside <p>, so here is a string hack.

    html = html.replace("<br>", "</p><p>")
    soup = BeautifulSoup(html, "html.parser")
    soup.find_all('p')
    

    Note that this is not an "open and closing paragraph" tag but rather "closing then opening".

    <p>
     part1
     </p><p>
     part2
     </p><p>
     part3
     </p><p>
     part4
    </p>
    

    Now all p tags match each other, though a bit messy with line breaks, which can be fixed by soup.prettify().


  2. You can’t put paragraphs inside paragraphs. That also doesn’t make any semantical sense at all. Physical Books also can’t have paragraphs inside paragraphs:

    <p>
     <p>part1</p>
     <p>part2</p>
     <p>part3</p>
     <p>part4</p>
    </p>
    

    Will be rendered by the browser like:

    <p></p>
    
     <p>part1</p>
     <p>part2</p>
     <p>part3</p>
     <p>part4</p>
    
    <p></p>
    

    See for yourself:

    <p>
     <p>part1</p>
     <p>part2</p>
     <p>part3</p>
     <p>part4</p>
    </p>
    Login or Signup to reply.
  3. You should not have a <p> tag wrapping other <p> tags. Each <p> (paragraph) tag should be used individually
    use following method

    <div>part1</div>
    <div>part2</div>
    <div>part3</div>
    <div>part4</div>
    

    or

    <p>part1</p>
    <p>part2</p>
    <p>part3</p>
    <p>part4</p>
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search