Some html writers use <br>
in <p>
, which is hard to separate when scraping them with bs4.
Example:
<p>
part1
<br>
part2
<br>
part3
<br>
part4
</p>
How to transform it into "real" paragraphs, like:
<p>
<p>part1</p>
<p>part2</p>
<p>part3</p>
<p>part4</p>
</p>
3
Answers
Recall that
<br>
is used inside<p>
, so here is a string hack.Note that this is not an "open and closing paragraph" tag but rather "closing then opening".
Now all
p
tags match each other, though a bit messy with line breaks, which can be fixed bysoup.prettify()
.You can’t put paragraphs inside paragraphs. That also doesn’t make any semantical sense at all. Physical Books also can’t have paragraphs inside paragraphs:
Will be rendered by the browser like:
See for yourself:
You should not have a
<p>
tag wrapping other<p>
tags. Each<p>
(paragraph) tag should be used individuallyuse following method
or