How can I check if a BeautifulSoup Tag is a block-level element (e.g. <p>
, <div>
, <h2>
), or a "phrase content" element like <span>
, <strong>
?
Basically I want to have a function that returns True for any Tag that is allowed inside of <p>
tag according to the HTML spec, and false for any Tag that is not allowed inside of a <p>
tag.
I’m asking this question because I don’t want to hardcode the list of allowed tags myself, but I can’t find anything from bs4
or html
docs about judging whether a Tag is phrasing content or not.
BeautifulSoup already knows which elements are allowed inside of <p>
and which are not:
>>> BeautifulSoup('<p><h2>')
<html><body><p></p><h2></h2></body></html>
>>> BeautifulSoup('<p><em>')
<html><body><p><em></em></p></body></html>
I would also be happy to use Python’s html
module if it can give me the answer.
3
Answers
I’m not sure that Beautiful soup knows what you are saying.
It’s more like it uses some engine to parse and fix the HTML.
It has this method
soup.get_text()
which returns all the text in HTML.
Maybe you are looking for this.
If not then it would help understand why you need such a function.
You can try this.
I hope this will help you a little.
Since BS doesn’t appear to provide a hard-coded list of elements in the phrasing category, you’ll have to resort to the definition in the HTML standard you’re going to target. For WHATWG HTML review draft (January 2022), the list of phrasing content is
a
,abbr
,area
,audio
,b
,bdi
,bdo
,br
,button
,canvas
,cite
,code
,data
,datalist
,del
,dfn
,em
,embed
,i
,iframe
,img
,input
,ins
,kbd
,label
,link
,map
,mark
,math
,meta
,meter
,noscript
,object
,output
,picture
,progress
,q
,ruby
,s
,samp
,select
,slot
,small
,span
,strong
,sub
,sup
,svg
,template
,textarea
,time
,u
,var
,video
,war
,keygen
(but check chapter 3.2.5.2.5 at https://html.spec.whatwg.org/multipage/dom.html#phrasing-content-2 for an up-to-date list).But: Even though the spec says phrasing content is accepted as content of
<p>
elements, it also says that a<p>
element’s end-element tag can be omitted (ie. the<p>
element is terminated) on anyaddress
,article
,aside
,blockquote
,details
,dialog
,div
,dl
,fieldset
,figure
,footer
,form
,h1
,h2
,h3
,h4
,h5
,h6
,header
,hgroup
,hr
,main
,nav
,ol
,p
,pre
,section
,style
,table
,ul
, ormenu
element (again, you need to check your target HTML spec; eg. the newsearch
element isn’t included in this list), which may or may not be relevant to your application.You can read a bit more on the interpretation of that part of the spec in the context of the older SGML-based HTML specs at https://sgmljs.net/docs/html5.html.