Html - Get inner text of multiple <p> with its corresponding <h2>?

Night_Guardian_0910
November 23, 2024
34 views
0 votes
2 Answers

I have some VBA coding experience, but I am still an newbie in Web Scraping.

I face the following problem:

I am trying to scrape the article text by getting all the <p> tags that come after <h2> tags.

The example HTML code is:

<div class="text-article_inside"> 
   <p>Paragraph0</p>

   <h2>Heading Text1</h2>

   <p>Paragraph1</p>
   <p>Paragraph2</p>

   <h2>Heading Text2</h2>

   <p>Paragraph3</p>
   <p>Paragraph4</p>
   <p>Paragraph5</p>

I know how to do this in case I have only <p> tags followed by only <h2> tags, I can then use getElementsbyTagName to first loop through all <p> tags and then loop through all <h2> tags. The problem that I face is that these 2 tags are constantly interchanging in the article, so the method I tried above will not work, as it break the order of the text.

I’ve found a lot of similar questions in Python and in Java, but none of them use VBA.

Does anyone know of a way to scrape the needed data in VBA?

My attempt:

Set HTMLCs = HTMLDoc.getElementsByTagName("div")

For Each HTMLC In HTMLCs

        If HTMLC.getAttribute("className") Like "text-article__inside" Then
        
            Set pTags = HTMLC.getElementsByTagName("p")
            
            N = pTags.Length + HTMLC.getElementsByTagName("h2").Length
            
            ReDim NewsContent(N)

            For Each pTag In pTags

                NewsContent(i) = pTag.innerText
                i = i + 1

            Next pTag

        End If
    
Next HTMLC

How should I handle the <h2> tag?

Answers

Using your sample HTML, something like this may help you as suggested by @C3roe ;

Sub Test()
    Dim strHTML As String, objHTML As Object, divArticlesInside As Object
    Dim i As Integer, h2Tags As Object, j As Integer, Node As Object
    
    strHTML = "    <div class=""text-article_inside"">              " & _
              "         <p>Paragraph0</p>                           " & _
              "         <h2>Heading Text1</h2>                      " & _
              "         <p>Paragraph1</p>                           " & _
              "         <p>Paragraph2</p>                           " & _
              "         <h2>Heading Text2</h2>                      " & _
              "         <p>Paragraph3</p>                           " & _
              "         <p>Paragraph4</p>                           " & _
              "         <p>Paragraph5</p>                           "
              
    Set objHTML = CreateObject("HTMLFILE")
    
    objHTML.body.innerHTML = strHTML
    
    Set divArticlesInside = objHTML.getElementsByTagName("div")
    
    For i = 0 To divArticlesInside.Length - 1
        If divArticlesInside(i).ClassName = "text-article_inside" Then
            Set h2Tags = divArticlesInside(i).getElementsByTagName("h2")
            
            For j = 0 To h2Tags.Length - 1
                If h2Tags(i).NextSibling.TagName = "P" Then
                    Set Node = h2Tags(j).NextSibling
                    Do
                        MsgBox Node.innerText
                        Set Node = Node.NextSibling
                        If Node Is Nothing Then Exit Do
                    Loop While Node.TagName = "P"
                End If
            Next
        End If
    Next
    
    Set objHTML = Nothing
End Sub

Sub Test()
Dim strHTML As String, objHTML As Object, divArticlesInside As Object
Dim i As Integer, h2Tags As Object, j As Integer, Node As Object

strHTML = "    <div class=""text-article_inside"">              " & _
          "         <p>Paragraph0</p>                           " & _
          "         <h2>Heading Text1</h2>                      " & _
          "         <p>Paragraph1</p>                           " & _
          "         <p>Paragraph2</p>                           " & _
          "         <h2>Heading Text2</h2>                      " & _
          "         <p>Paragraph3</p>                           " & _
          "         <p>Paragraph4</p>                           " & _
          "         <p>Paragraph5</p>                           "
          
Set objHTML = CreateObject("HTMLFILE")

objHTML.body.innerHTML = strHTML

Set divArticlesInside = objHTML.getElementsByTagName("div")

For i = 0 To divArticlesInside.Length - 1
    If divArticlesInside(i).ClassName = "text-article_inside" Then
        Set h2Tags = divArticlesInside(i).getElementsByTagName("h2")
        
        For j = 0 To h2Tags.Length - 1
            If h2Tags(i).NextSibling.TagName = "P" Then
                Set Node = h2Tags(j).NextSibling
                Do
                    MsgBox Node.innerText
                    Set Node = Node.NextSibling
                    If Node Is Nothing Then Exit Do
                Loop While Node.TagName = "P"
            End If
        Next
    End If
Next

Set objHTML = Nothing

End Sub

Please signup or login to give your own answer.

Click here to cancel reply.

Html – Get inner text of multiple <p> with its corresponding <h2>?

Answers