skip to Main Content

How can I check if a BeautifulSoup Tag is a block-level element (e.g. <p>, <div>, <h2>), or a "phrase content" element like <span>, <strong>?

Basically I want to have a function that returns True for any Tag that is allowed inside of <p> tag according to the HTML spec, and false for any Tag that is not allowed inside of a <p> tag.

I’m asking this question because I don’t want to hardcode the list of allowed tags myself, but I can’t find anything from bs4 or html docs about judging whether a Tag is phrasing content or not.

BeautifulSoup already knows which elements are allowed inside of <p> and which are not:

>>> BeautifulSoup('<p><h2>')
<html><body><p></p><h2></h2></body></html>
>>> BeautifulSoup('<p><em>')
<html><body><p><em></em></p></body></html>

I would also be happy to use Python’s html module if it can give me the answer.

3

Answers


  1. I’m not sure that Beautiful soup knows what you are saying.
    It’s more like it uses some engine to parse and fix the HTML.
    It has this method soup.get_text()
    which returns all the text in HTML.
    Maybe you are looking for this.
    If not then it would help understand why you need such a function.

    Login or Signup to reply.
  2. You can try this.

    from bs4 import BeautifulSoup
    
    def is_phrasing_content(tag_name, parser="html.parser"):
        snippet = f"<p><{tag_name}></{tag_name}></p>"
        soup = BeautifulSoup(snippet, parser)
    
        p_tag = soup.find("p")
        if not p_tag:
            return False
    
        found_inside_p = p_tag.find(tag_name)
        return (found_inside_p is not None)
    
    print(is_phrasing_content("em"))
    print(is_phrasing_content("span"))
    print(is_phrasing_content("div"))
    print(is_phrasing_content("h2"))
    

    I hope this will help you a little.

    Login or Signup to reply.
  3. Since BS doesn’t appear to provide a hard-coded list of elements in the phrasing category, you’ll have to resort to the definition in the HTML standard you’re going to target. For WHATWG HTML review draft (January 2022), the list of phrasing content is a, abbr, area, audio, b, bdi, bdo, br, button, canvas, cite, code, data, datalist, del, dfn, em, embed, i, iframe, img, input, ins, kbd, label, link, map, mark, math, meta, meter, noscript, object, output, picture, progress, q, ruby, s, samp, select, slot, small, span, strong, sub, sup, svg, template, textarea, time, u, var, video, war, keygen (but check chapter 3.2.5.2.5 at https://html.spec.whatwg.org/multipage/dom.html#phrasing-content-2 for an up-to-date list).

    But: Even though the spec says phrasing content is accepted as content of <p> elements, it also says that a <p> element’s end-element tag can be omitted (ie. the <p> element is terminated) on any address, article, aside, blockquote, details, dialog, div, dl, fieldset, figure, footer, form, h1, h2, h3, h4, h5, h6, header, hgroup, hr, main, nav, ol, p, pre, section, style, table, ul, or menu element (again, you need to check your target HTML spec; eg. the new search element isn’t included in this list), which may or may not be relevant to your application.

    You can read a bit more on the interpretation of that part of the spec in the context of the older SGML-based HTML specs at https://sgmljs.net/docs/html5.html.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search