How do I make BeautifulSoup ignore any indents in original HTML when getting text

clel
September 27, 2023
103 views
0 votes
2 Answers

I think, I basically want the reverse of what the prettify() function does.

When one has HTML code (excerpt) like:

      <p>
        Test text with something in it
        Test text with something in it
        <i>and italic text</i> inside that text.
        Test text with something in it.
      </p>
      <p>
        Next paragraph with more text.
      </p>

How can one get the text inside without the line breaks and indentations? This all while looping recursively over the tree to also be able to cover nested tags?

The result after parsing and processing should be something like:

Test text with something in it Test text with something in it textit{and italic text} inside that text. Test text with something in it.

Next paragraph with more text.

Also, for further processing, it would be good to get the content of italic tags separately in Python.

That means (simplified; in reality, I want to call pylatex functions to compose a document):

string result = ""
for child in soup.children:
    for subchild in child.children:
        # Some processing
        result += subchild.string

Most of this is not that complicated, but how can one deal correctly with line breaks and spaces for the nested text?

The browser seems to render this correctly.

If not possible with BeautifulSoup, another Python library doing this is also fine.

I was quite shocked that this isn’t dealt with by default in BeautifulSoup and I also didn’t find any function doing what I want.

Answers

You can use .get_text() (with strip=True and correct separator= parameters):

import re

from bs4 import BeautifulSoup

html_text = """
      <p>
        Test text with something in it
        Test text with something in it
        <i>and italic text</i> inside that text.
        Test text with something in it.
      </p>
      <p>
        Next paragraph with more text.
      </p>
"""

soup = BeautifulSoup(html_text, "html.parser")


def my_get_text(tag):
    t = tag.get_text(strip=True, separator=" ")
    return re.sub(r"s{2,}", " ", t)


# replace all <i></i> with textit{ ... }
for i in soup.select("i"):
    i.replace_with("\textit{{{}}}".format(i.text))

for p in soup.select("p"):
    print(my_get_text(p))

Prints:

Test text with something in it Test text with something in it textit{and italic text} inside that text. Test text with something in it.
Next paragraph with more text.

EDIT: Using recursion:

import re

from bs4 import BeautifulSoup, NavigableString, Tag

html_text = """
      <p>
        Test text with something in it
        Test text with something in it
        <i>and italic text</i> inside that text.
        Test text with something in it.
      </p>
      <p>
        Next paragraph with more text.
      </p>
"""


soup = BeautifulSoup(html_text, "html.parser")


def my_get_text(tag):
    t = tag.get_text(strip=True, separator=" ")
    return re.sub(r"s{2,}", " ", t)


def get_text(tag):
    s = []
    for c in tag.contents:
        match c:
            case NavigableString():
                if c := my_get_text(c):
                    s.append(c)
            case Tag() if c.name == "p":
                yield from get_text(c)
            case Tag() if c.name == "i":
                s.append("\textit{{{}}}".format(c.text))
    if s:
        yield s


for t in get_text(soup):
    print(" ".join(t))

Prints:

Test text with something in it Test text with something in it textit{and italic text} inside that text. Test text with something in it.
Next paragraph with more text.

You can use lxml to do it. Compared with beautifulsoup, it will be more free in some aspects：

from lxml import etree
import textwrap

html_str = """
  <p>
    Test text with something in it
    Test text with something in it
    <i>and italic text</i> inside that text.
    Test text with something in it.
  </p>
  <p>
    Next paragraph with more text.
  </p>
"""
root = etree.HTML(html_str)

# remove indent from paragraph text, and strip
for elem in root.iterdescendants():
    if text:=elem.text:
        elem.text = textwrap.dedent(text).strip()
    if tail:=elem.tail:
        elem.tail = textwrap.dedent(tail).strip()

# customize special element processing logic
handle_tag_dict = {
    "i" : lambda x: "\textit{%s}"%x
}
# Tags that do not require additional line breaks
not_lb_tags = ["i"]

result = ""
for elem in root.iterdescendants():
    tag = elem.tag
    if result and tag not in not_lb_tags:
        result += "n"

    if text:=elem.text:
        if tag in not_lb_tags:
            result += " "
        # 1.replace("n", " ")
        # 2. convert excess white space into a single and remove left and right white spaces
        text = " ".join(filter(None, text.replace("n", " ").split(" ")))
        if tag in handle_tag_dict:
            text = handle_tag_dict[tag](text)
        result += text

    if tail:=elem.tail:
        result += " "
        tail = " ".join(filter(None, tail.replace("n", " ").split(" ")))
        result += tail

print(result)

Print:

Test text with something in it Test text with something in it textit{and italic text} inside that text. Test text with something in it.
Next paragraph with more text.

Please signup or login to give your own answer.

Click here to cancel reply.