I am quite new to Python. I’ve been working on a web-scraping project that extracts data from various web pages, constructs a new HTML page using the data, and sends the page to a document management system
The document management system has some XML-based parser for validating the HTML. It will reject it if XML special characters appear in text within HTML tags. For example:
<p>The price of apples & oranges in New York is > the price of apples and oranges in Chicago</p>
will get rejected because of the &
and the >
.
I considered using String.replace()
on the HTML doc before sending it, but it is not broad enough, and I don’t want to remove valid occurrences of characters like &
and >
, such as when they form part of a tag or an attribute
Could someone please suggest a solution to replacing the XML special characters with, for example, their english word equivalents (eg: & -> and)?
Any help you can provide would be much appreciated
2
Answers
BeautifulSoup tames unruly HTML and presents it as unbroken HTML. You can use it to fix references like this.
Outputs
Note that HTML itself is not necessarily XML compliant and there may be other reasons why an HTML document would not pass an XML validator.
I suggest to use a html parser, like from lxml:
output: