I think, I basically want the reverse of what the prettify()
function does.
When one has HTML code (excerpt) like:
<p>
Test text with something in it
Test text with something in it
<i>and italic text</i> inside that text.
Test text with something in it.
</p>
<p>
Next paragraph with more text.
</p>
How can one get the text inside without the line breaks and indentations? This all while looping recursively over the tree to also be able to cover nested tags?
The result after parsing and processing should be something like:
Test text with something in it Test text with something in it textit{and italic text} inside that text. Test text with something in it.
Next paragraph with more text.
Also, for further processing, it would be good to get the content of italic tags separately in Python.
That means (simplified; in reality, I want to call pylatex
functions to compose a document):
string result = ""
for child in soup.children:
for subchild in child.children:
# Some processing
result += subchild.string
Most of this is not that complicated, but how can one deal correctly with line breaks and spaces for the nested text?
The browser seems to render this correctly.
If not possible with BeautifulSoup, another Python library doing this is also fine.
I was quite shocked that this isn’t dealt with by default in BeautifulSoup and I also didn’t find any function doing what I want.
2
Answers
You can use
.get_text()
(withstrip=True
and correctseparator=
parameters):Prints:
EDIT: Using recursion:
Prints:
You can use lxml to do it. Compared with beautifulsoup, it will be more free in some aspects:
Print: