skip to Main Content

I have this html:

<html lang="en" class="no-js">
    <div>
        <p class="price ">
            3.75
        </p>
        <p>21</p>
    </div>
</html>

I want to get the class of this

The problem is what ever I do to try to get it, every time he comes without the space.

current_element.get(‘class’)…

Even str(current_element) come like this:

'<p class="price">3.75</p>'

How can I get the text of the class in raw? Or something like that?
Regex of all the html is not a option cuz I can have htmls with 11k of lines and more

Thanks!

2

Answers


  1. Class names in HTML cannot have spaces in them. Spaces are used within documents to separate classes when more than one is assigned to an element. In this case, the trailing space without any further class is treated by a single class assignment.

    Any HTML parser must interpret it that way, browsers and libraries alike, as the space isn’t part of the name it won’t be returned by libraries or by the DOM JS functions. This is expected behavior.

    If you really want to get that space, you need to use other means of parsing the HTML, some library that does not understand HTML so that it doesn’t interprets it.

    Login or Signup to reply.
  2. If you use the keyword argument multi_valued_attributes=None in your beautifulsoup constructor you will get the class string with the space.
    (Source: https://beautiful-soup-4.readthedocs.io/en/latest/#multi-valued-attributes )

    You will however lose the functionality of accessing multi-value attributes (such as class) as lists

    from bs4 import BeautifulSoup
    html = """<html lang="en" class="no-js">
        <div>
            <p class="price ">
                3.75
            </p>
            <p>21</p>
        </div>
    </html>"""
    
    soup = BeautifulSoup(html, multi_valued_attributes=None)
    soup.html.div.p["class"]
    

    Result:

    'price '
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search