Html - Get trimmed down text from Powershell

David
April 16, 2024
223 views
0 votes
2 Answers

I am scraping version information from a website. I am able to get the information, but unable to get it without formatting. Currently targeting the DIV tag with Id j_idt19. Is there a way to get the info from the td withing table with id page_footer. I am unable to get to the specific TD with the text.

I would like to place the result into a csv, and just get the text into a text file as Num.NumNum.NumNumNum

# Retrieve the front page of Reddit
$response = Invoke-WebRequest -Uri "https://www.somesite.com/index.xhtml"

# Select the titles and URLs of the top stories
$results1 = $response.ParsedHtml.getElementsByTagName(“Div”) | Where-Object {$_.id -eq “j_idt19”} | Select-Object -Property TextContent
$results2 = $response.ParsedHtml.getElementsByTagName(“Div”) | Where-Object {$_.id -eq “j_idt19”} | Select-Object -Property TextContent | Out-String

Write-Output $results
$results1 | Export-Csv -Path “C:UsersASTRTW3DesktopDavid_ScriptsURL_TEST5.csv"
$results2 | Out-File -FilePath “C:UsersASTRTW3DesktopDavid_ScriptsURL_TEST5.txt"

Html code being scraped

<div id="j_idt19" class="ui-layout-unit ui-widget ui-widget-content ui-corner-all ui-layout-south ui-layout-pane ui-layout-pane-south" style="position: absolute; margin: 0px; inset: auto 5px 0px; width: auto; z-index: 0; height: 26px; display: block; visibility: visible;"><div class="ui-layout-unit-content ui-widget-content" style="position: relative; height: 22px; visibility: visible;">

  <table id="page_footer" style="width: 100%; border-top: 1px solid #cbc3be !important;">
    <tbody><tr>
      <td style="width: 30%;">
        
      </td>

      <td style="width: 40%; text-align: center;"><span style="font-weight: bold;">1.14.012</span>
      </td>

      <td style="width: 15%; text-align: right;">&nbsp;</td>

      <td style="text-align: right; width: 20px; margin-top: 2px;"><div id="j_idt23" style="width:18px;height:18px;position:fixed;right:130px;bottom:2px"><div id="j_idt23_start" style="display:none"><img id="progressBar" src="/CSDB/resources/images/loader_footer.gif"></div><div id="j_idt23_complete" style="display:none"></div></div>
      </td>
    </tr>
  </tbody></table></div></div>

csv result

#TYPE Selected.System.__ComObject
"textContent"
"

  
    
      
        
      

      1.14.012
      

      ?

      
      
    
  "

Text result

textContent                                                                           
-----------                                                                           
...

expected result
CSV

#TYPE Selected.System.__ComObject
"textContent"
1.14.012

text

1.14.012

Answers

- SantiagoSquarzon
- April 12, 2024 at 2:20 am
- 0 votes
0
I’ll assume what you’re after is always a version contained in a <span> within a <td>, in which case the code you could use would be:
```
$response.ParsedHtml.getElementById('j_idt19') | ForEach-Object {
    $ver = $null
    foreach ($td in $_.getElementsByTagName('td')) {
        $td.getElementsByTagName('span') |
            Where-Object { [version]::TryParse($_.textContent, [ref] $ver) } |
            Select-Object textContent
    }
} | Export-Csv pathtocsv
```
Login or Signup to reply.

- JoslenCaven
- April 16, 2024 at 8:07 pm
- 0 votes
0
To scrape the specific td within the table with id page_footer, you can try using Invoke-WebRequest to fetch the page, and then drill down to the desired table and td using a combination of ParsedHtml.getElementsByTagName and filtering by id. Once you’re at the table level, navigate to your target td by indexing or additional filtering. PowerShell doesn’t directly support CSS selectors, so you’ll have to step through the DOM elements. For outputting just the text to a CSV or text file, utilize PowerShell’s Export-Csv and Out-File commands with the appropriate text content you’ve extracted.

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Html – Get trimmed down text from Powershell

Answers