I’ve hit an interesting problem that has eluded me thus far. I’m trying to extract specific information from a local html document. It’s essentially a series of tables, and I only need specific values. I’ve imported the document using
$sourcePath = "C:TempRecord.htm"
$oIE = New-Object -ComObject InternetExplorer.Application
$oIE.Navigate($sourcePath)
$sourceHTML = $oIE.Document
Using the IE comobject was necessary as "HTMLFile" created an object but none of the inner/outer text was available. I’ve broken down the file into rows for parsing, using
$sourceHTML.body.getElementsByTagName('td')
But herein lies my problem. I need to get the 8 digit number from this entry, but I am falling short:
<td width="25%"><font face="Arial" size="1"><b>Serial Number</b></font></td>
<td width="25%"><font face="Arial" size="1">8111111</font></td>
Edit: Longer section of html as requested. There about six of these tables in the document:
<p style="text-align: center;"><font style="color: rgb(255, 0, 0); font-family: Arial Narrow;
font-size: 20pt; font-weight: bold;">TITLE OF TABLE
</font></p><h2>Registration Details</h2><br><table width="100%" bordercolor=
"#000000" border="1" cellspacing="0">
<tbody><tr>
<td width="25%" bordercolorlight="#ffffff" bordercolordark="#ffffff"><b><font face="Arial" size="1">Personal Details</font></b></td>
<td width="25%" bordercolorlight="#ffffff" bordercolordark="#ffffff"> </td>
<td width="25%" bordercolorlight="#ffffff" bordercolordark="#ffffff"><b><font face="Arial" size="1">Contact (Work) Address Details</font></b></td>
<td width="25%" bordercolorlight="#ffffff" bordercolordark="#ffffff"> </td>
</tr>
</tbody></table>
<table width="100%" border="0" cellspacing="0">
<tbody><tr>
<td width="25%"><font face="Arial" size="1"><b>Employment</b></font></td>
<td width="25%"><font face="Arial" size="1"><b>CompanyNameHere</b></font></td>
<td width="25%"><font face="Arial" size="1"><b>Workplace</b></font></td>
<td width="25%"><font face="Arial" size="1"><b>Test Street</b></font></td>
</tr>
<tr>
<td width="25%"><font face="Arial" size="1"><b>Employment Type</b></font></td>
<td width="25%"><font face="Arial" size="1">Regular</font></td>
<td width="25%"><font face="Arial" size="1"><b>Address Line 1</b></font></td>
<td width="25%"><font face="Arial" size="1">10 Earth Place</font></td>
</tr>
<tr>
<td width="25%"><font face="Arial" size="1"><b>Employment Category</b></font></td>
<td width="25%"><font face="Arial" size="1"></font></td>
<td width="25%"><font face="Arial" size="1"><b>Address Line 2</b></font></td>
<td width="25%"><font face="Arial" size="1"></font></td>
</tr>
<tr>
<td width="25%"><font face="Arial" size="1"><b>Employment Option</b></font></td>
<td width="25%"><font face="Arial" size="1"></font></td>
<td width="25%"><font face="Arial" size="1"><b>Address Line 3</b></font></td>
<td width="25%"><font face="Arial" size="1"></font></td>
</tr>
<tr>
<td width="25%"><font face="Arial" size="1"><b>Serial Number</b></font></td>
<td width="25%"><font face="Arial" size="1">8111111</font></td>
<td width="25%"><font face="Arial" size="1"><b>Suburb/Town/City</b></font></td>
<td width="25%"><font face="Arial" size="1">City Lakes</font></td>
</tr>
</tbody></table><br>
I tried to use regex and pull a 7 digit number starting with 8 (they all will), but that also pulled all numbers with an 8 and following digits, such as with GUIDs etc. Is there a better way to do this? I will need to pull multiple values from different tables in the document, and I don’t think a regex is suitable method for everything. Ideally if I can match the column header (Serial Number) and then extract the value from the next row, but I’m not 100% sure on how to do that.
Thank you
2
Answers
Assuming the the 8 digit number you’re looking for is in the table row where the
td
with textSerial Number
appears this might help you get it without need for regex and usingHtmlFile
ComObject:To parse an HTML table using PowerShell, you can utilize the HTML Agility Pack library. The HTML Agility Pack allows you to load and navigate HTML documents, making it easy to extract data from HTML tables. Here’s an example PowerShell script that demonstrates how to parse an HTML table:
In this script, you will need to update "YOUR_PATH_TOHtmlAgilityPack.dll" with the actual path to the HtmlAgilityPack.dll DLL file. Similarly, replace "YOUR_HTML_FILE_PATH" with the path to your HTML file.
The script uses the HtmlAgilityPack library to load the HTML document and navigate its elements. It retrieves the table element using an XPath expression and then iterates through each row and cell to extract the data.
The parsed table data is stored in an array of hashtables, where each hashtable represents a row and its cell values are associated with their respective headers.
Finally, the script outputs the parsed table data. You can modify the output to suit your requirements, such as exporting it to a CSV file or performing further data processing.