skip to Main Content

I have an HTML file which contains XML at the bottom of it and enclosed by comments, it looks like this:

<!DOCTYPE html>
<html>
<head>
    ***
</head>
<body>
    <div class="panel panel-primary call__report-modal-panel">
        <div class="panel-heading text-center custom-panel-heading">
            <h2>Report</h2>
        </div>
        <div class="panel-body">
            <div class="panel panel-default">
                <div class="panel-heading">
                    <div class="panel-title">Info</div>
                </div>
                <div class="panel-body">
                    <table class="table table-bordered table-page-break-auto table-layout-fixed">
                        <tr>
                            <td class="col-sm-4">ID</td>
                            <td class="col-sm-8">1</td>
                        </tr>

            </table>
        </div>
    </div>
</body>
</html>
<!--<?xml version = "1.0" encoding="Windows-1252" standalone="yes"?>
<ROOTTAG>
  <mytag>
    <headername>BASE</headername>
    <fieldname>NAME</fieldname>
    <val><![CDATA[Testcase]]></val>
  </mytag>
  <mytag>
    <headername>BASE</headername>
    <fieldname>AGE</fieldname>
    <val><![CDATA[5]]></val>
  </mytag>

</ROOTTAG>
-->

Requirement is to parse the XML which is in comments in above HTML.
So far I have tried to read the HTML file and pass it to a string and did following:

with open('my_html.html', 'rb') as file:
    d = str(file.read())
    d2 = d[d.index('<!--') + 4:d.index('-->')]
    d3 = "'''"+d2+"'''"

this is returning XML piece of data in string d3 with 3 single qoutes.

Then trying to read it via Etree:

ET.fromstring(d3)

but it is failing with following error:

xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 2

need some help to basically:

  • Read HTML
  • take out snippet with XML piece which is commented at bottom of HTML
  • take that string and pass to ET.fromString() function, but since this function takes string with triple qoutes, it is not formatting it properly and hence throwing the error

4

Answers


  1. First, split up your html and xml by just reading line by line and using an if string.startswith to filter out the comment block:

    with open('xmlfile.xml') as fh:
        html, xml = [], []
    
        for line in fh:
            # check for that comment line
            if line.startswith('<!--'):
                break
    
            html.append(line)
    
        # append current line
        xml.append(line)
    
        # keep iterating
        for line in fh:
            # check for ending block comment
            if line.startswith('-->'):
                break
            xml.append(line)
    
    # Get the root tag to close everything up
    root_tag = xml[1].strip().strip('<>')
    
    # add the closing tag and join, using the 4: slice to strip off block comment
    xml = ''.join((*xml, f'</{root_tag}>'))[4:]
    html = ''.join(html)
    

    Now you should be able to parse them independently using your parser of choice

    Login or Signup to reply.
  2. You already have been on the right path. I put your HTML in the file and it works fine like following.

    import xml.etree.ElementTree as ET
    
    with open('extract_xml.html') as handle:
        content = handle.read()
        xml = content[content.index('<!--')+4: content.index('-->')]
        document = ET.fromstring(xml)
    
        for element in document.findall("./mytag"):
            for child in element:
                print(child, child.text)
    
    Login or Signup to reply.
  3. With the build in html.parser() (Doc) you get the xml comment as string what you can parse with xml.entree.ElementTree:

    from html.parser import HTMLParser
    import xml.etree.ElementTree as ET
    
    class MyHTMLParser(HTMLParser):
            
        def handle_comment(self, data):
            xml_str = data
            tree = ET.fromstring(xml_str)
            for elem in tree.iter():
                print(elem.tag, elem.text)
    
    parser = MyHTMLParser()
    
    with open("your.html", "r") as f:
        lines = f.readlines()
        
    for line in lines:
        parser.feed(line)
    

    Output:

    ROOTTAG 
      
    mytag 
        
    headername BASE
    fieldname NAME
    val Testcase
    mytag 
        
    headername BASE
    fieldname AGE
    val 5
    
    Login or Signup to reply.
  4. If you read the file one line at a time you’ll find this easier to manage.

    import xml.etree.ElementTree as ET
    
    START_COMMENT = '<!--'
    END_COMMENT = '-->'
    
    def getxml(filename):
        with open(filename) as data:
            lines = []
            inxml = False
            for line in data.readlines():
                if inxml:
                    if line.startswith(END_COMMENT):
                        inxml = False
                    else:
                        lines.append(line)
                elif line.startswith(START_COMMENT):
                    inxml = True
            return ''.join(lines)
    
    ET.fromstring(xml := getxml('/Volumes/G-Drive/foo.html'))
    print(xml)
    

    Output:

    <ROOTTAG>
      <mytag>
        <headername>BASE</headername>
        <fieldname>NAME</fieldname>
        <val><![CDATA[Testcase]]></val>
      </mytag>
      <mytag>
        <headername>BASE</headername>
        <fieldname>AGE</fieldname>
        <val><![CDATA[5]]></val>
      </mytag>
    </ROOTTAG>
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search