I have an HTML file which contains XML at the bottom of it and enclosed by comments, it looks like this:
<!DOCTYPE html>
<html>
<head>
***
</head>
<body>
<div class="panel panel-primary call__report-modal-panel">
<div class="panel-heading text-center custom-panel-heading">
<h2>Report</h2>
</div>
<div class="panel-body">
<div class="panel panel-default">
<div class="panel-heading">
<div class="panel-title">Info</div>
</div>
<div class="panel-body">
<table class="table table-bordered table-page-break-auto table-layout-fixed">
<tr>
<td class="col-sm-4">ID</td>
<td class="col-sm-8">1</td>
</tr>
</table>
</div>
</div>
</body>
</html>
<!--<?xml version = "1.0" encoding="Windows-1252" standalone="yes"?>
<ROOTTAG>
<mytag>
<headername>BASE</headername>
<fieldname>NAME</fieldname>
<val><![CDATA[Testcase]]></val>
</mytag>
<mytag>
<headername>BASE</headername>
<fieldname>AGE</fieldname>
<val><![CDATA[5]]></val>
</mytag>
</ROOTTAG>
-->
Requirement is to parse the XML which is in comments in above HTML.
So far I have tried to read the HTML file and pass it to a string and did following:
with open('my_html.html', 'rb') as file:
d = str(file.read())
d2 = d[d.index('<!--') + 4:d.index('-->')]
d3 = "'''"+d2+"'''"
this is returning XML piece of data in string d3 with 3 single qoutes.
Then trying to read it via Etree:
ET.fromstring(d3)
but it is failing with following error:
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 2
need some help to basically:
- Read HTML
- take out snippet with XML piece which is commented at bottom of HTML
- take that string and pass to ET.fromString() function, but since this function takes string with triple qoutes, it is not formatting it properly and hence throwing the error
4
Answers
First, split up your html and xml by just reading line by line and using an
if string.startswith
to filter out the comment block:Now you should be able to parse them independently using your parser of choice
You already have been on the right path. I put your HTML in the file and it works fine like following.
With the build in
html.parser()
(Doc) you get the xml comment as string what you can parse withxml.entree.ElementTree
:Output:
If you read the file one line at a time you’ll find this easier to manage.
Output: