skip to Main Content

I am quite new to Python. I’ve been working on a web-scraping project that extracts data from various web pages, constructs a new HTML page using the data, and sends the page to a document management system

The document management system has some XML-based parser for validating the HTML. It will reject it if XML special characters appear in text within HTML tags. For example:

<p>The price of apples & oranges in New York is > the price of apples and oranges in Chicago</p>

will get rejected because of the & and the >.

I considered using String.replace() on the HTML doc before sending it, but it is not broad enough, and I don’t want to remove valid occurrences of characters like & and >, such as when they form part of a tag or an attribute

Could someone please suggest a solution to replacing the XML special characters with, for example, their english word equivalents (eg: & -> and)?

Any help you can provide would be much appreciated

2

Answers


  1. BeautifulSoup tames unruly HTML and presents it as unbroken HTML. You can use it to fix references like this.

    from bs4 import BeautifulSoup
    
    doc = """<body>
    <p>The price of apples & oranges in New York is > the price of apples and oranges in Chicago</p>
    </body>"""
    
    soup = BeautifulSoup(doc, features="lxml")
    print(soup.prettify())
    

    Outputs

    <html>
     <body>
      <p>
       The price of apples &amp; oranges in New York is &gt; the price of apples and oranges in Chicago
      </p>
     </body>
    </html>
    

    Note that HTML itself is not necessarily XML compliant and there may be other reasons why an HTML document would not pass an XML validator.

    Login or Signup to reply.
  2. I suggest to use a html parser, like from lxml:

    from lxml import etree
    from io import StringIO
    
    xml_ ="""<p>The price of apples & oranges in New York is > the price of apples and oranges in Chicago</p>"""
    parser = etree.HTMLParser()
    tree = etree.fromstring(xml_, parser)
    
    etree.dump(tree)
    

    output:

    <html>
      <body>
        <p>The price of apples &amp; oranges in New York is &gt; the price of apples and oranges in Chicago</p>
      </body>
    </html>
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search