skip to Main Content

I have some VBA coding experience, but I am still an newbie in Web Scraping.

I face the following problem:

I am trying to scrape the article text by getting all the <p> tags that come after <h2> tags.

The example HTML code is:

<div class="text-article_inside"> 
   <p>Paragraph0</p>

   <h2>Heading Text1</h2>

   <p>Paragraph1</p>
   <p>Paragraph2</p>

   <h2>Heading Text2</h2>

   <p>Paragraph3</p>
   <p>Paragraph4</p>
   <p>Paragraph5</p>

I know how to do this in case I have only <p> tags followed by only <h2> tags, I can then use getElementsbyTagName to first loop through all <p> tags and then loop through all <h2> tags. The problem that I face is that these 2 tags are constantly interchanging in the article, so the method I tried above will not work, as it break the order of the text.

I’ve found a lot of similar questions in Python and in Java, but none of them use VBA.

Does anyone know of a way to scrape the needed data in VBA?

My attempt:

Set HTMLCs = HTMLDoc.getElementsByTagName("div")

For Each HTMLC In HTMLCs

        If HTMLC.getAttribute("className") Like "text-article__inside" Then
        
            Set pTags = HTMLC.getElementsByTagName("p")
            
            N = pTags.Length + HTMLC.getElementsByTagName("h2").Length
            
            ReDim NewsContent(N)

            For Each pTag In pTags

                NewsContent(i) = pTag.innerText
                i = i + 1

            Next pTag

        End If
    
Next HTMLC

How should I handle the <h2> tag?

2

Answers


  1. Using your sample HTML, something like this may help you as suggested by @C3roe ;

    Sub Test()
        Dim strHTML As String, objHTML As Object, divArticlesInside As Object
        Dim i As Integer, h2Tags As Object, j As Integer, Node As Object
        
        strHTML = "    <div class=""text-article_inside"">              " & _
                  "         <p>Paragraph0</p>                           " & _
                  "         <h2>Heading Text1</h2>                      " & _
                  "         <p>Paragraph1</p>                           " & _
                  "         <p>Paragraph2</p>                           " & _
                  "         <h2>Heading Text2</h2>                      " & _
                  "         <p>Paragraph3</p>                           " & _
                  "         <p>Paragraph4</p>                           " & _
                  "         <p>Paragraph5</p>                           "
                  
        Set objHTML = CreateObject("HTMLFILE")
        
        objHTML.body.innerHTML = strHTML
        
        Set divArticlesInside = objHTML.getElementsByTagName("div")
        
        For i = 0 To divArticlesInside.Length - 1
            If divArticlesInside(i).ClassName = "text-article_inside" Then
                Set h2Tags = divArticlesInside(i).getElementsByTagName("h2")
                
                For j = 0 To h2Tags.Length - 1
                    If h2Tags(i).NextSibling.TagName = "P" Then
                        Set Node = h2Tags(j).NextSibling
                        Do
                            MsgBox Node.innerText
                            Set Node = Node.NextSibling
                            If Node Is Nothing Then Exit Do
                        Loop While Node.TagName = "P"
                    End If
                Next
            End If
        Next
        
        Set objHTML = Nothing
    End Sub
    
    Login or Signup to reply.
  2. Sub Test()
    Dim strHTML As String, objHTML As Object, divArticlesInside As Object
    Dim i As Integer, h2Tags As Object, j As Integer, Node As Object

    strHTML = "    <div class=""text-article_inside"">              " & _
              "         <p>Paragraph0</p>                           " & _
              "         <h2>Heading Text1</h2>                      " & _
              "         <p>Paragraph1</p>                           " & _
              "         <p>Paragraph2</p>                           " & _
              "         <h2>Heading Text2</h2>                      " & _
              "         <p>Paragraph3</p>                           " & _
              "         <p>Paragraph4</p>                           " & _
              "         <p>Paragraph5</p>                           "
              
    Set objHTML = CreateObject("HTMLFILE")
    
    objHTML.body.innerHTML = strHTML
    
    Set divArticlesInside = objHTML.getElementsByTagName("div")
    
    For i = 0 To divArticlesInside.Length - 1
        If divArticlesInside(i).ClassName = "text-article_inside" Then
            Set h2Tags = divArticlesInside(i).getElementsByTagName("h2")
            
            For j = 0 To h2Tags.Length - 1
                If h2Tags(i).NextSibling.TagName = "P" Then
                    Set Node = h2Tags(j).NextSibling
                    Do
                        MsgBox Node.innerText
                        Set Node = Node.NextSibling
                        If Node Is Nothing Then Exit Do
                    Loop While Node.TagName = "P"
                End If
            Next
        End If
    Next
    
    Set objHTML = Nothing
    

    End Sub

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search