Premise
There is this website I’m trying to scrape.
If you put organization ID in a search bar, and press "Искать", it redirects you to a separate page with base url of https://pb.nalog.ru/search.html and hash "#t=*&mode=search-all&queryAll=ID" where t are current miliseconds (from Date.gettime()
)
The problem
If I use the url generated by macro and manually put it into a browser, it returns the correct page, but whenever I try to do it programmatically, it returns 404 Page not found website dummy; moreover, the url it returns is different from mine:
- The one I got
https://pb.nalog.ru/search.html#t=1730356470622&mode=search-all&queryAll=9714055795
- The one website received
/search.html%23t=1730356470622&mode=search-all&queryAll=9714055795
I assume %23 is a conversion of #, but I’m new to this and can’t say for sure. I will try to answer all of the follow-up questions.
Here’s the code in question:
Option Explicit
Private Type SYSTEMTIME
wYear As Integer
wMonth As Integer
wDayOfWeek As Integer
wDay As Integer
wHour As Integer
wMinute As Integer
wSecond As Integer
wMilliseconds As Integer
End Type
Private Declare PtrSafe Sub GetSystemTime Lib "kernel32" (lpSystemTime As SYSTEMTIME)
Function CurrentTimeMillis() As Double
' Returns the milliseconds from 1970/01/01 00:00:00.0 to system UTC
Dim st As SYSTEMTIME
GetSystemTime st
Dim t_Start, t_Now
t_Start = DateSerial(1970, 1, 1) ' Starting time for Linux
t_Now = DateSerial(st.wYear, st.wMonth, st.wDay) + _
TimeSerial(st.wHour, st.wMinute, st.wSecond)
CurrentTimeMillis = DateDiff("s", t_Start, t_Now) * 1000 + st.wMilliseconds
End Function
Public Sub oopsie_doopsie()
Dim http As New XMLHTTP60
Dim html As New HTMLDocument
Dim curr As Double
curr = CurrentTimeMillis(): Debug.Print curr
With http
.Open "GET", "https://pb.nalog.ru/search.html#t=" & curr & "&mode=search-all&queryAll=" & "9714055795" & "", False
Debug.Print "https://pb.nalog.ru/search.html#t=" & curr & "&mode=search-all&queryAll=" & "9714055795" & ""
DoEvents
.send
DoEvents
html.body.innerHTML = .responseText
End With
html.getElementsByClassName ("pb-subject-status pb-subject-status--active")
End Sub
2
Answers
I guess you need to use the XMLHTTP60.setRequestHeader to create the header which the server accepts:
Use something like Postman to research the header which sent by any browser.
I guess the server needs some meta info to render the page relevant to client’s browser, for eg.
As far as I see, the page on the URL is not static and the element
pb-subject-status pb-subject-status--active
will not be visible even if you succeed in your request with your original code.On the other hand, I think there are some elements being filled from some
JSON
sources.My Language is not Russian so, I don’t understand from response I get in the below trial code. This code, will get some
JSON
data and write it in cellA1
.Just check the data and see if your required data is available somewhere in the text. If you find your data there, then you can parse the string with Tim Hall’s JSon Parser
.
Screenshot of the page is added below: