skip to Main Content

Premise

There is this website I’m trying to scrape.

https://pb.nalog.ru/

If you put organization ID in a search bar, and press "Искать", it redirects you to a separate page with base url of https://pb.nalog.ru/search.html and hash "#t=*&mode=search-all&queryAll=ID" where t are current miliseconds (from Date.gettime())

The problem

If I use the url generated by macro and manually put it into a browser, it returns the correct page, but whenever I try to do it programmatically, it returns 404 Page not found website dummy; moreover, the url it returns is different from mine:

  • The one I got

https://pb.nalog.ru/search.html#t=1730356470622&mode=search-all&queryAll=9714055795

  • The one website received

/search.html%23t=1730356470622&mode=search-all&queryAll=9714055795

I assume %23 is a conversion of #, but I’m new to this and can’t say for sure. I will try to answer all of the follow-up questions.

Here’s the code in question:

Option Explicit

Private Type SYSTEMTIME
    wYear As Integer
    wMonth As Integer
    wDayOfWeek As Integer
    wDay As Integer
    wHour As Integer
    wMinute As Integer
    wSecond As Integer
    wMilliseconds As Integer
End Type

Private Declare PtrSafe Sub GetSystemTime Lib "kernel32" (lpSystemTime As SYSTEMTIME)

Function CurrentTimeMillis() As Double
    ' Returns the milliseconds from 1970/01/01 00:00:00.0 to system UTC
    Dim st As SYSTEMTIME
    GetSystemTime st
    Dim t_Start, t_Now
    t_Start = DateSerial(1970, 1, 1) ' Starting time for Linux
    t_Now = DateSerial(st.wYear, st.wMonth, st.wDay) + _
        TimeSerial(st.wHour, st.wMinute, st.wSecond)
    CurrentTimeMillis = DateDiff("s", t_Start, t_Now) * 1000 + st.wMilliseconds
End Function

Public Sub oopsie_doopsie()

    Dim http As New XMLHTTP60
    Dim html As New HTMLDocument
    Dim curr As Double
    
    curr = CurrentTimeMillis(): Debug.Print curr
    With http
        .Open "GET", "https://pb.nalog.ru/search.html#t=" & curr & "&mode=search-all&queryAll=" & "9714055795" & "", False
        Debug.Print "https://pb.nalog.ru/search.html#t=" & curr & "&mode=search-all&queryAll=" & "9714055795" & ""
        DoEvents
        .send
        DoEvents
        html.body.innerHTML = .responseText
    End With
    
    html.getElementsByClassName ("pb-subject-status pb-subject-status--active")
    
End Sub

2

Answers


  1. I guess you need to use the XMLHTTP60.setRequestHeader to create the header which the server accepts:

        With http
            .Open "GET", "https://pb.nalog.ru/search.html#t=" & curr & "&mode=search-all&queryAll=" & "9714055795" & "", False
            .setRequestHeader bstrHeader, bstrValue
        ... 
    

    Use something like Postman to research the header which sent by any browser.

    I guess the server needs some meta info to render the page relevant to client’s browser, for eg.

    Login or Signup to reply.
  2. As far as I see, the page on the URL is not static and the element pb-subject-status pb-subject-status--active will not be visible even if you succeed in your request with your original code.

    On the other hand, I think there are some elements being filled from some JSON sources.

    My Language is not Russian so, I don’t understand from response I get in the below trial code. This code, will get some JSON data and write it in cell A1.

    Just check the data and see if your required data is available somewhere in the text. If you find your data there, then you can parse the string with Tim Hall’s JSon Parser

    Sub Test()
        Dim objHTTP As Object, strURL As String, PayLoad As String, requestID As String, strJSON As String
        
        Set objHTTP = CreateObject("MSXML2.XMLHTTP.6.0")
        
        strURL = "https://pb.nalog.ru/search-proc.json"
        
        PayLoad = "mode=search-all&queryAll=9714055795&queryUl=&okvedUl=&regionUl=&statusUl=&isMspUl=&mspUl1=1&mspUl2=1&mspUl3=1&queryIp=&okvedIp=&regionIp=&statusIp=&isMspIp=&mspIp1=1&mspIp2=1&mspIp3=1&taxIp=&queryUpr=&uprType1=1&uprType0=1&queryRdl=&dateRdl=&queryAddr=&regionAddr=&queryOgr=&ogrFl=1&ogrUl=1&ogrnUlDoc=&ogrnIpDoc=&npTypeDoc=1&nameUlDoc=&nameIpDoc=&formUlDoc=&formIpDoc=&ifnsDoc=&dateFromDoc=&dateToDoc=&page=1&pageSize=10&pbCaptchaToken=&token="
           
        objHTTP.Open "POST", strURL, False
        objHTTP.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
        objHTTP.send PayLoad
        
        strJSON = objHTTP.responseText
        requestID = Split(Split(strJSON, """id"":""")(1), """")(0)
            
        PayLoad = "id=" & requestID & "&method=get-response"
        
        strURL = "https://pb.nalog.ru/search-proc.json"
        objHTTP.Open "POST", strURL, False
        objHTTP.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
        objHTTP.send PayLoad
        
        strJSON = objHTTP.responseText
        
        Range("A1") = strJSON
        
        Set objHTTP = Nothing
    End Sub
    

    .
    Screenshot of the page is added below:

    enter image description here

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search