skip to Main Content

I think, I basically want the reverse of what the prettify() function does.

When one has HTML code (excerpt) like:

      <p>
        Test text with something in it
        Test text with something in it
        <i>and italic text</i> inside that text.
        Test text with something in it.
      </p>
      <p>
        Next paragraph with more text.
      </p>

How can one get the text inside without the line breaks and indentations? This all while looping recursively over the tree to also be able to cover nested tags?

The result after parsing and processing should be something like:

Test text with something in it Test text with something in it textit{and italic text} inside that text. Test text with something in it.

Next paragraph with more text.

Also, for further processing, it would be good to get the content of italic tags separately in Python.

That means (simplified; in reality, I want to call pylatex functions to compose a document):

string result = ""
for child in soup.children:
    for subchild in child.children:
        # Some processing
        result += subchild.string

Most of this is not that complicated, but how can one deal correctly with line breaks and spaces for the nested text?

The browser seems to render this correctly.

If not possible with BeautifulSoup, another Python library doing this is also fine.

I was quite shocked that this isn’t dealt with by default in BeautifulSoup and I also didn’t find any function doing what I want.

2

Answers


  1. You can use .get_text() (with strip=True and correct separator= parameters):

    import re
    
    from bs4 import BeautifulSoup
    
    html_text = """
          <p>
            Test text with something in it
            Test text with something in it
            <i>and italic text</i> inside that text.
            Test text with something in it.
          </p>
          <p>
            Next paragraph with more text.
          </p>
    """
    
    soup = BeautifulSoup(html_text, "html.parser")
    
    
    def my_get_text(tag):
        t = tag.get_text(strip=True, separator=" ")
        return re.sub(r"s{2,}", " ", t)
    
    
    # replace all <i></i> with textit{ ... }
    for i in soup.select("i"):
        i.replace_with("\textit{{{}}}".format(i.text))
    
    for p in soup.select("p"):
        print(my_get_text(p))
    

    Prints:

    Test text with something in it Test text with something in it textit{and italic text} inside that text. Test text with something in it.
    Next paragraph with more text.
    

    EDIT: Using recursion:

    import re
    
    from bs4 import BeautifulSoup, NavigableString, Tag
    
    html_text = """
          <p>
            Test text with something in it
            Test text with something in it
            <i>and italic text</i> inside that text.
            Test text with something in it.
          </p>
          <p>
            Next paragraph with more text.
          </p>
    """
    
    
    soup = BeautifulSoup(html_text, "html.parser")
    
    
    def my_get_text(tag):
        t = tag.get_text(strip=True, separator=" ")
        return re.sub(r"s{2,}", " ", t)
    
    
    def get_text(tag):
        s = []
        for c in tag.contents:
            match c:
                case NavigableString():
                    if c := my_get_text(c):
                        s.append(c)
                case Tag() if c.name == "p":
                    yield from get_text(c)
                case Tag() if c.name == "i":
                    s.append("\textit{{{}}}".format(c.text))
        if s:
            yield s
    
    
    for t in get_text(soup):
        print(" ".join(t))
    

    Prints:

    Test text with something in it Test text with something in it textit{and italic text} inside that text. Test text with something in it.
    Next paragraph with more text.
    
    Login or Signup to reply.
  2. You can use lxml to do it. Compared with beautifulsoup, it will be more free in some aspects:

    from lxml import etree
    import textwrap
    
    html_str = """
      <p>
        Test text with something in it
        Test text with something in it
        <i>and italic text</i> inside that text.
        Test text with something in it.
      </p>
      <p>
        Next paragraph with more text.
      </p>
    """
    root = etree.HTML(html_str)
    
    # remove indent from paragraph text, and strip
    for elem in root.iterdescendants():
        if text:=elem.text:
            elem.text = textwrap.dedent(text).strip()
        if tail:=elem.tail:
            elem.tail = textwrap.dedent(tail).strip()
    
    # customize special element processing logic
    handle_tag_dict = {
        "i" : lambda x: "\textit{%s}"%x
    }
    # Tags that do not require additional line breaks
    not_lb_tags = ["i"]
    
    result = ""
    for elem in root.iterdescendants():
        tag = elem.tag
        if result and tag not in not_lb_tags:
            result += "n"
    
        if text:=elem.text:
            if tag in not_lb_tags:
                result += " "
            # 1.replace("n", " ")
            # 2. convert excess white space into a single and remove left and right white spaces
            text = " ".join(filter(None, text.replace("n", " ").split(" ")))
            if tag in handle_tag_dict:
                text = handle_tag_dict[tag](text)
            result += text
    
        if tail:=elem.tail:
            result += " "
            tail = " ".join(filter(None, tail.replace("n", " ").split(" ")))
            result += tail
    
    print(result)
    

    Print:

    Test text with something in it Test text with something in it textit{and italic text} inside that text. Test text with something in it.
    Next paragraph with more text.
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search