skip to Main Content

I’ve hit an interesting problem that has eluded me thus far. I’m trying to extract specific information from a local html document. It’s essentially a series of tables, and I only need specific values. I’ve imported the document using

$sourcePath = "C:TempRecord.htm"
$oIE = New-Object -ComObject InternetExplorer.Application
$oIE.Navigate($sourcePath)
$sourceHTML = $oIE.Document

Using the IE comobject was necessary as "HTMLFile" created an object but none of the inner/outer text was available. I’ve broken down the file into rows for parsing, using

$sourceHTML.body.getElementsByTagName('td')

But herein lies my problem. I need to get the 8 digit number from this entry, but I am falling short:

<td width="25%"><font face="Arial" size="1"><b>Serial Number</b></font></td>
<td width="25%"><font face="Arial" size="1">8111111</font></td>

Edit: Longer section of html as requested. There about six of these tables in the document:

<p style="text-align: center;"><font style="color: rgb(255, 0, 0); font-family: Arial Narrow; 
font-size: 20pt; font-weight: bold;">TITLE OF TABLE
</font></p><h2>Registration Details</h2><br><table width="100%" bordercolor=
"#000000" border="1" cellspacing="0">
<tbody><tr>
<td width="25%" bordercolorlight="#ffffff" bordercolordark="#ffffff"><b><font face="Arial" size="1">Personal Details</font></b></td>
<td width="25%" bordercolorlight="#ffffff" bordercolordark="#ffffff">&nbsp;</td>
<td width="25%" bordercolorlight="#ffffff" bordercolordark="#ffffff"><b><font face="Arial" size="1">Contact (Work) Address Details</font></b></td>
<td width="25%" bordercolorlight="#ffffff" bordercolordark="#ffffff">&nbsp;</td>
</tr>
</tbody></table>
<table width="100%" border="0" cellspacing="0">
<tbody><tr>
<td width="25%"><font face="Arial" size="1"><b>Employment</b></font></td>
<td width="25%"><font face="Arial" size="1"><b>CompanyNameHere</b></font></td>
<td width="25%"><font face="Arial" size="1"><b>Workplace</b></font></td>
<td width="25%"><font face="Arial" size="1"><b>Test Street</b></font></td>
</tr>
<tr>
<td width="25%"><font face="Arial" size="1"><b>Employment Type</b></font></td>
<td width="25%"><font face="Arial" size="1">Regular</font></td>
<td width="25%"><font face="Arial" size="1"><b>Address Line 1</b></font></td>
<td width="25%"><font face="Arial" size="1">10 Earth Place</font></td>
</tr>
<tr>
<td width="25%"><font face="Arial" size="1"><b>Employment Category</b></font></td>
<td width="25%"><font face="Arial" size="1"></font></td>
<td width="25%"><font face="Arial" size="1"><b>Address Line 2</b></font></td>
<td width="25%"><font face="Arial" size="1"></font></td>
</tr>
<tr>
<td width="25%"><font face="Arial" size="1"><b>Employment Option</b></font></td>
<td width="25%"><font face="Arial" size="1"></font></td>
<td width="25%"><font face="Arial" size="1"><b>Address Line 3</b></font></td>
<td width="25%"><font face="Arial" size="1"></font></td>
</tr>
<tr>
<td width="25%"><font face="Arial" size="1"><b>Serial Number</b></font></td>
<td width="25%"><font face="Arial" size="1">8111111</font></td>
<td width="25%"><font face="Arial" size="1"><b>Suburb/Town/City</b></font></td>
<td width="25%"><font face="Arial" size="1">City Lakes</font></td>
</tr>
</tbody></table><br>

I tried to use regex and pull a 7 digit number starting with 8 (they all will), but that also pulled all numbers with an 8 and following digits, such as with GUIDs etc. Is there a better way to do this? I will need to pull multiple values from different tables in the document, and I don’t think a regex is suitable method for everything. Ideally if I can match the column header (Serial Number) and then extract the value from the next row, but I’m not 100% sure on how to do that.

Thank you

2

Answers


  1. Assuming the the 8 digit number you’re looking for is in the table row where the td with text Serial Number appears this might help you get it without need for regex and using HtmlFile ComObject:

    $content = Get-Content "C:TempRecord.htm" -Raw
    $html = New-Object -ComObject HtmlFile
    $html.write([System.Text.Encoding]::Unicode.GetBytes($content))
    $html.getElementsByTagName('tr') | ForEach-Object {
        $_.getElementsByTagName('td') |
            Where-Object innerText |
            Where-Object { $_.innerText.Trim() -eq 'Serial Number' } |
            ForEach-Object { $_.nextSibling.innerText }
    }
    
    Login or Signup to reply.
  2. To parse an HTML table using PowerShell, you can utilize the HTML Agility Pack library. The HTML Agility Pack allows you to load and navigate HTML documents, making it easy to extract data from HTML tables. Here’s an example PowerShell script that demonstrates how to parse an HTML table:

    # Install the HTML Agility Pack using NuGet package manager
    # Install-Package HtmlAgilityPack
    
    # Import the required libraries
    Add-Type -Path "YOUR_PATH_TOHtmlAgilityPack.dll"
    
    # Load the HTML document
    $html = New-Object HtmlAgilityPack.HtmlDocument
    $source = Get-Content -Path "YOUR_HTML_FILE_PATH" -Raw
    $html.LoadHtml($source)
    
    # Get the table element
    $table = $html.DocumentNode.SelectSingleNode("//table")
    
    # Get the table headers
    $headers = $table.SelectNodes("//th") | ForEach-Object { $_.InnerText.Trim() }
    
    # Initialize an empty array for storing the table data
    $data = @()
    
    # Get the table rows
    $rows = $table.SelectNodes("//tr")
    
    # Iterate through each row
    foreach ($row in $rows) {
        # Get the table cells for each row
        $cells = $row.SelectNodes("td") | ForEach-Object { $_.InnerText.Trim() }
    
        # Create a hashtable to store the row data
        $rowData = @{}
    
        # Assign each cell value to the corresponding header
        for ($i = 0; $i -lt $headers.Count; $i++) {
            $rowData[$headers[$i]] = $cells[$i]
        }
    
        # Add the row data to the array
        $data += $rowData
    }
    
    # Output the parsed table data
    $data
    

    In this script, you will need to update "YOUR_PATH_TOHtmlAgilityPack.dll" with the actual path to the HtmlAgilityPack.dll DLL file. Similarly, replace "YOUR_HTML_FILE_PATH" with the path to your HTML file.

    The script uses the HtmlAgilityPack library to load the HTML document and navigate its elements. It retrieves the table element using an XPath expression and then iterates through each row and cell to extract the data.

    The parsed table data is stored in an array of hashtables, where each hashtable represents a row and its cell values are associated with their respective headers.

    Finally, the script outputs the parsed table data. You can modify the output to suit your requirements, such as exporting it to a CSV file or performing further data processing.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search