skip to Main Content

Is it possible to remove all text from HTML nodes with a regex? This very simple case seems to work just fine:

import htmlmin

html = """
<li class="menu-item">
  <p class="menu-item__heading">Totopos</p>
  <p>Chips and molcajete salsa</p>
  <p class="menu-item__details menu-item__details--price">
    <strong>
      <span class="menu-item__currency"> $ </span>
      4
    </strong>
  </p>
</li>
"""

print(re.sub(">(.*?)<", ">1<", htmlmin.minify(html)))

I tried to use BeautifulSoup but I cannot figure out how to make it work. Using the following code example is not quite correct since it is leaving "4" in as text.

soup = BeautifulSoup(html, "html.parser")
for n in soup.find_all(recursive=True):
    print(n.name, n.string)
    if n.string:
        n.string = ""
print(minify(str(soup)))

2

Answers


  1. try to use text=True when you call find_all and call extract() on element to remove it:

    from bs4 import BeautifulSoup
    
    html = '''
    <li class="menu-item">
      <p class="menu-item__heading">Totopos</p>
      <p>Chips and molcajete salsa</p>
      <p class="menu-item__details menu-item__details--price">
        <strong>
          <span class="menu-item__currency"> $ </span>
          4
        </strong>
      </p>
    </li>
    '''
    
    soup = BeautifulSoup(html, 'html.parser')
    for element in soup.find_all(text=True):
        element.extract()
    
    print(soup.prettify())
    

    the output will be in this case:

    <li class="menu-item">
     <p class="menu-item__heading">
     </p>
     <p>
     </p>
     <p class="menu-item__details menu-item__details--price">
      <strong>
       <span class="menu-item__currency">
       </span>
      </strong>
     </p>
    </li>
    
    Login or Signup to reply.
  2. Attempting to manipulate HTML using regular expressions is almost never the best idea, but this regex should do the trick for you:

    print(re.sub(r">[^<]+<", "><", htmlmin.minify(html)))
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search