How to Replace XML Special Characters From Text Within HTML Tags Using Python

cjt
November 1, 2024
127 views
0 votes
2 Answers

I am quite new to Python. I’ve been working on a web-scraping project that extracts data from various web pages, constructs a new HTML page using the data, and sends the page to a document management system

The document management system has some XML-based parser for validating the HTML. It will reject it if XML special characters appear in text within HTML tags. For example:

<p>The price of apples & oranges in New York is > the price of apples and oranges in Chicago</p>

will get rejected because of the & and the >.

I considered using String.replace() on the HTML doc before sending it, but it is not broad enough, and I don’t want to remove valid occurrences of characters like & and >, such as when they form part of a tag or an attribute

Could someone please suggest a solution to replacing the XML special characters with, for example, their english word equivalents (eg: & -> and)?

Any help you can provide would be much appreciated

Answers

- tdelaney
- November 1, 2024 at 7:07 pm
- 0 votes
0
BeautifulSoup tames unruly HTML and presents it as unbroken HTML. You can use it to fix references like this.
```
from bs4 import BeautifulSoup

doc = """<body>
<p>The price of apples & oranges in New York is > the price of apples and oranges in Chicago</p>
</body>"""

soup = BeautifulSoup(doc, features="lxml")
print(soup.prettify())
```
Outputs
```
<html>
 <body>
  <p>
   The price of apples &amp; oranges in New York is &gt; the price of apples and oranges in Chicago
  </p>
 </body>
</html>
```
Note that HTML itself is not necessarily XML compliant and there may be other reasons why an HTML document would not pass an XML validator.
Login or Signup to reply.

I suggest to use a html parser, like from lxml:

from lxml import etree
from io import StringIO

xml_ ="""<p>The price of apples & oranges in New York is > the price of apples and oranges in Chicago</p>"""
parser = etree.HTMLParser()
tree = etree.fromstring(xml_, parser)

etree.dump(tree)

output:

<html>
  <body>
    <p>The price of apples &amp; oranges in New York is &gt; the price of apples and oranges in Chicago</p>
  </body>
</html>

Please signup or login to give your own answer.

Click here to cancel reply.