I have some VBA coding experience, but I am still an newbie in Web Scraping.
I face the following problem:
I am trying to scrape the article text by getting all the <p>
tags that come after <h2>
tags.
The example HTML code is:
<div class="text-article_inside">
<p>Paragraph0</p>
<h2>Heading Text1</h2>
<p>Paragraph1</p>
<p>Paragraph2</p>
<h2>Heading Text2</h2>
<p>Paragraph3</p>
<p>Paragraph4</p>
<p>Paragraph5</p>
I know how to do this in case I have only <p>
tags followed by only <h2>
tags, I can then use getElementsbyTagName to first loop through all <p>
tags and then loop through all <h2>
tags. The problem that I face is that these 2 tags are constantly interchanging in the article, so the method I tried above will not work, as it break the order of the text.
I’ve found a lot of similar questions in Python and in Java, but none of them use VBA.
Does anyone know of a way to scrape the needed data in VBA?
My attempt:
Set HTMLCs = HTMLDoc.getElementsByTagName("div")
For Each HTMLC In HTMLCs
If HTMLC.getAttribute("className") Like "text-article__inside" Then
Set pTags = HTMLC.getElementsByTagName("p")
N = pTags.Length + HTMLC.getElementsByTagName("h2").Length
ReDim NewsContent(N)
For Each pTag In pTags
NewsContent(i) = pTag.innerText
i = i + 1
Next pTag
End If
Next HTMLC
How should I handle the <h2>
tag?
2
Answers
Using your sample HTML, something like this may help you as suggested by @C3roe ;
Sub Test()
Dim strHTML As String, objHTML As Object, divArticlesInside As Object
Dim i As Integer, h2Tags As Object, j As Integer, Node As Object
End Sub