Can Python parse XML within HTML?

ShChawla
May 11, 2023
154 views
3 votes
4 Answers

I have an HTML file which contains XML at the bottom of it and enclosed by comments, it looks like this:

<!DOCTYPE html>
<html>
<head>
    ***
</head>
<body>
    <div class="panel panel-primary call__report-modal-panel">
        <div class="panel-heading text-center custom-panel-heading">
            <h2>Report</h2>
        </div>
        <div class="panel-body">
            <div class="panel panel-default">
                <div class="panel-heading">
                    <div class="panel-title">Info</div>
                </div>
                <div class="panel-body">
                    <table class="table table-bordered table-page-break-auto table-layout-fixed">
                        <tr>
                            <td class="col-sm-4">ID</td>
                            <td class="col-sm-8">1</td>
                        </tr>

            </table>
        </div>
    </div>
</body>
</html>
<!--<?xml version = "1.0" encoding="Windows-1252" standalone="yes"?>
<ROOTTAG>
  <mytag>
    <headername>BASE</headername>
    <fieldname>NAME</fieldname>
    <val><![CDATA[Testcase]]></val>
  </mytag>
  <mytag>
    <headername>BASE</headername>
    <fieldname>AGE</fieldname>
    <val><![CDATA[5]]></val>
  </mytag>

</ROOTTAG>
-->

Requirement is to parse the XML which is in comments in above HTML.
So far I have tried to read the HTML file and pass it to a string and did following:

with open('my_html.html', 'rb') as file:
    d = str(file.read())
    d2 = d[d.index('<!--') + 4:d.index('-->')]
    d3 = "'''"+d2+"'''"

this is returning XML piece of data in string d3 with 3 single qoutes.

Then trying to read it via Etree:

ET.fromstring(d3)

but it is failing with following error:

xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 2

need some help to basically:

Read HTML
take out snippet with XML piece which is commented at bottom of HTML
take that string and pass to ET.fromString() function, but since this function takes string with triple qoutes, it is not formatting it properly and hence throwing the error

Answers

First, split up your html and xml by just reading line by line and using an if string.startswith to filter out the comment block:

with open('xmlfile.xml') as fh:
    html, xml = [], []

    for line in fh:
        # check for that comment line
        if line.startswith('<!--'):
            break

        html.append(line)

    # append current line
    xml.append(line)

    # keep iterating
    for line in fh:
        # check for ending block comment
        if line.startswith('-->'):
            break
        xml.append(line)

# Get the root tag to close everything up
root_tag = xml[1].strip().strip('<>')

# add the closing tag and join, using the 4: slice to strip off block comment
xml = ''.join((*xml, f'</{root_tag}>'))[4:]
html = ''.join(html)

Now you should be able to parse them independently using your parser of choice

- ThomasLehmann
- May 11, 2023 at 6:08 am
- 0 votes
0
You already have been on the right path. I put your HTML in the file and it works fine like following.
```
import xml.etree.ElementTree as ET

with open('extract_xml.html') as handle:
    content = handle.read()
    xml = content[content.index('')]
    document = ET.fromstring(xml)

    for element in document.findall("./mytag"):
        for child in element:
            print(child, child.text)
```
Login or Signup to reply.

With the build in html.parser() (Doc) you get the xml comment as string what you can parse with xml.entree.ElementTree:

from html.parser import HTMLParser
import xml.etree.ElementTree as ET

class MyHTMLParser(HTMLParser):
        
    def handle_comment(self, data):
        xml_str = data
        tree = ET.fromstring(xml_str)
        for elem in tree.iter():
            print(elem.tag, elem.text)

parser = MyHTMLParser()

with open("your.html", "r") as f:
    lines = f.readlines()
    
for line in lines:
    parser.feed(line)

Output:

ROOTTAG 
  
mytag 
    
headername BASE
fieldname NAME
val Testcase
mytag 
    
headername BASE
fieldname AGE
val 5

If you read the file one line at a time you’ll find this easier to manage.

import xml.etree.ElementTree as ET

START_COMMENT = '<!--'
END_COMMENT = '-->'

def getxml(filename):
    with open(filename) as data:
        lines = []
        inxml = False
        for line in data.readlines():
            if inxml:
                if line.startswith(END_COMMENT):
                    inxml = False
                else:
                    lines.append(line)
            elif line.startswith(START_COMMENT):
                inxml = True
        return ''.join(lines)

ET.fromstring(xml := getxml('/Volumes/G-Drive/foo.html'))
print(xml)

Output:

<ROOTTAG>
  <mytag>
    <headername>BASE</headername>
    <fieldname>NAME</fieldname>
    <val><![CDATA[Testcase]]></val>
  </mytag>
  <mytag>
    <headername>BASE</headername>
    <fieldname>AGE</fieldname>
    <val><![CDATA[5]]></val>
  </mytag>
</ROOTTAG>

Please signup or login to give your own answer.

Click here to cancel reply.