skip to Main Content

I am scraping version information from a website. I am able to get the information, but unable to get it without formatting. Currently targeting the DIV tag with Id j_idt19. Is there a way to get the info from the td withing table with id page_footer. I am unable to get to the specific TD with the text.

I would like to place the result into a csv, and just get the text into a text file as Num.NumNum.NumNumNum

# Retrieve the front page of Reddit
$response = Invoke-WebRequest -Uri "https://www.somesite.com/index.xhtml"

# Select the titles and URLs of the top stories
$results1 = $response.ParsedHtml.getElementsByTagName(“Div”) | Where-Object {$_.id -eq “j_idt19”} | Select-Object -Property TextContent
$results2 = $response.ParsedHtml.getElementsByTagName(“Div”) | Where-Object {$_.id -eq “j_idt19”} | Select-Object -Property TextContent | Out-String

Write-Output $results
$results1 | Export-Csv -Path “C:UsersASTRTW3DesktopDavid_ScriptsURL_TEST5.csv"
$results2 | Out-File -FilePath “C:UsersASTRTW3DesktopDavid_ScriptsURL_TEST5.txt"

Html code being scraped

<div id="j_idt19" class="ui-layout-unit ui-widget ui-widget-content ui-corner-all ui-layout-south ui-layout-pane ui-layout-pane-south" style="position: absolute; margin: 0px; inset: auto 5px 0px; width: auto; z-index: 0; height: 26px; display: block; visibility: visible;"><div class="ui-layout-unit-content ui-widget-content" style="position: relative; height: 22px; visibility: visible;">

  <table id="page_footer" style="width: 100%; border-top: 1px solid #cbc3be !important;">
    <tbody><tr>
      <td style="width: 30%;">
        
      </td>

      <td style="width: 40%; text-align: center;"><span style="font-weight: bold;">1.14.012</span>
      </td>

      <td style="width: 15%; text-align: right;">&nbsp;</td>

      <td style="text-align: right; width: 20px; margin-top: 2px;"><div id="j_idt23" style="width:18px;height:18px;position:fixed;right:130px;bottom:2px"><div id="j_idt23_start" style="display:none"><img id="progressBar" src="/CSDB/resources/images/loader_footer.gif"></div><div id="j_idt23_complete" style="display:none"></div></div>
      </td>
    </tr>
  </tbody></table></div></div>

csv result

#TYPE Selected.System.__ComObject
"textContent"
"

  
    
      
        
      

      1.14.012
      

      ?

      
      
    
  "

Text result

textContent                                                                           
-----------                                                                           
...                                                                                   

expected result
CSV

#TYPE Selected.System.__ComObject
"textContent"
1.14.012

text

1.14.012

2

Answers


  1. I’ll assume what you’re after is always a version contained in a <span> within a <td>, in which case the code you could use would be:

    $response.ParsedHtml.getElementById('j_idt19') | ForEach-Object {
        $ver = $null
        foreach ($td in $_.getElementsByTagName('td')) {
            $td.getElementsByTagName('span') |
                Where-Object { [version]::TryParse($_.textContent, [ref] $ver) } |
                Select-Object textContent
        }
    } | Export-Csv pathtocsv
    
    Login or Signup to reply.
  2. To scrape the specific td within the table with id page_footer, you can try using Invoke-WebRequest to fetch the page, and then drill down to the desired table and td using a combination of ParsedHtml.getElementsByTagName and filtering by id. Once you’re at the table level, navigate to your target td by indexing or additional filtering. PowerShell doesn’t directly support CSS selectors, so you’ll have to step through the DOM elements. For outputting just the text to a CSV or text file, utilize PowerShell’s Export-Csv and Out-File commands with the appropriate text content you’ve extracted.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search