Html - Detect if Tag is a block-level element?

Nils
January 7, 2025
199 views
3 votes
3 Answers

How can I check if a BeautifulSoup Tag is a block-level element (e.g. <p>, <div>, <h2>), or a "phrase content" element like <span>, <strong>?

Basically I want to have a function that returns True for any Tag that is allowed inside of <p> tag according to the HTML spec, and false for any Tag that is not allowed inside of a <p> tag.

I’m asking this question because I don’t want to hardcode the list of allowed tags myself, but I can’t find anything from bs4 or html docs about judging whether a Tag is phrasing content or not.

BeautifulSoup already knows which elements are allowed inside of <p> and which are not:

>>> BeautifulSoup('<p><h2>')
<html><body><p></p><h2></h2></body></html>
>>> BeautifulSoup('<p><em>')
<html><body><p><em></em></p></body></html>

I would also be happy to use Python’s html module if it can give me the answer.

Answers

- PavloKvas
- January 7, 2025 at 9:03 pm
- 0 votes
0
I’m not sure that Beautiful soup knows what you are saying.
It’s more like it uses some engine to parse and fix the HTML.
It has this method soup.get_text()
which returns all the text in HTML.
Maybe you are looking for this.
If not then it would help understand why you need such a function.

Login or Signup to reply.

You can try this.

from bs4 import BeautifulSoup

def is_phrasing_content(tag_name, parser="html.parser"):
    snippet = f"<p><{tag_name}></{tag_name}></p>"
    soup = BeautifulSoup(snippet, parser)

    p_tag = soup.find("p")
    if not p_tag:
        return False

    found_inside_p = p_tag.find(tag_name)
    return (found_inside_p is not None)

print(is_phrasing_content("em"))
print(is_phrasing_content("span"))
print(is_phrasing_content("div"))
print(is_phrasing_content("h2"))

I hope this will help you a little.

- imhotap
- January 7, 2025 at 9:57 pm
- 0 votes
0
Since BS doesn’t appear to provide a hard-coded list of elements in the phrasing category, you’ll have to resort to the definition in the HTML standard you’re going to target. For WHATWG HTML review draft (January 2022), the list of phrasing content is a, abbr, area, audio, b, bdi, bdo, br, button, canvas, cite, code, data, datalist, del, dfn, em, embed, i, iframe, img, input, ins, kbd, label, link, map, mark, math, meta, meter, noscript, object, output, picture, progress, q, ruby, s, samp, select, slot, small, span, strong, sub, sup, svg, template, textarea, time, u, var, video, war, keygen (but check chapter 3.2.5.2.5 at https://html.spec.whatwg.org/multipage/dom.html#phrasing-content-2 for an up-to-date list).

But: Even though the spec says phrasing content is accepted as content of <p> elements, it also says that a <p> element’s end-element tag can be omitted (ie. the <p> element is terminated) on any address, article, aside, blockquote, details, dialog, div, dl, fieldset, figure, footer, form, h1, h2, h3, h4, h5, h6, header, hgroup, hr, main, nav, ol, p, pre, section, style, table, ul, or menu element (again, you need to check your target HTML spec; eg. the new search element isn’t included in this list), which may or may not be relevant to your application.

You can read a bit more on the interpretation of that part of the spec in the context of the older SGML-based HTML specs at https://sgmljs.net/docs/html5.html.

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Html – Detect if Tag is a block-level element?

Answers