skip to Main Content

I’m trying to automate some features and I need to scarp a web page.

So, I’m using browserkit to make external request to website.

Everything seems good but there’s no useful result in my response because target page is using modern JavaScript.

Let’s take a look :

PHP file

require "./vendor/autoload.php";


use SymfonyComponentHttpClientHttpClient;

$client = HttpClient::create([
    'max_redirects' => 7,
]);

$response = $client->request(
    'GET',
    'https://secure.e-konsulat.gov.pl'
);

$statusCode = $response->getStatusCode();

$contentType = $response->getHeaders()['content-type'][0];

$content = $response->getContent();

dd($content); //dd() is installed globally on my php installation

and this is $content result:

^ """
<!DOCTYPE html><html lang="en"><head>


  <meta charset="utf-8">


  <title>System Zdalnej Rejestracji</title>


  <base href="/">


  <meta name="viewport" content="width=device-width, initial-scale=1">


  <link rel="icon" type="image/x-icon" href="favicon.ico">


  <link rel="preconnect" href="https://fonts.gstatic.com">


  <style type="text/css">@font-face{font-family:'Material Icons';font-style:normal;font-weight:400;src:url(https://fonts.gstatic.com/s/materialicons/v139/flUhRq ▶
<style>*,:after,:before{box-sizing:border-box;}@media (prefers-reduced-motion:no-preference){:root{scroll-behavior:smooth;}}body{margin:0;font-family:var(--bs-f ▶
<body>


  <!--[if lt IE 11]>


    <div style="padding: 1em; border-bottom: 1px solid #0052a5;">Serwis Ministerstwa Spraw Zagranicznych Rzeczypospolitej Polskiej</div>


    <div style="font-size: 1.5em; margin: 4em auto; max-width: 1024px; text-align: center;">


      <br />


      <p>Dear User,</p><br />


      <p>Your browser's version is not supported by the application.</p><br /><br />


      Please, actualize Your browser or use another one


    </div>


  <![endif]-->


  <app-root></app-root>


<script src="runtime.31b3be7ffe3f39288917.js" defer></script><script src="polyfills.da272157cf92c2e29a93.js" defer></script><script src="main.375c8bb1538d8323b9 ▶


</body></html>
"""

and as you can see in this lines:

  <p>Dear User,</p><br />


<p>Your browser's version is not supported by the application.</p><br /><br />


Please, actualize Your browser or use another one

UPDATE:

I did use Browser class either but there’s same result:

$browser = new SymfonyComponentBrowserKitHttpBrowser(HttpClient::create());
$browser->request('GET', 'https://secure.e-konsulat.gov.pl');
$response = $browser->getResponse();

dd($response);

Do you know how to solve this issue?

And as a mention it’s my first experience on web scraping so detailed answers are appreciated.

Thanks in advance

2

Answers


  1. Usually this is caused by software (in this case the receiving webserver) validating your browser and to check if your browser can handle the website. It could be a solution to ‘mimick’ the request of a ‘state of the art’ browser. This can done by providing your request with a User-Agent.

    Using HttpClient you can try the following:

    $response = $client->request(
        'GET',
        'https://secure.e-konsulat.gov.pl',
        [
           'headers' => [
                'User-Agent' => 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/116.0',
           ],
        ]
    );
    

    Using BrowserKit you can try the following:

    $browser
       ->setHttpHeader('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/116.0')
       ->request('GET','https://secure.e-konsulat.gov.pl');
    

    The more you ‘mimick’ a normal request made by a regular visitor, the more lickely you will receive a decent response. Another example can be found here.

    Login or Signup to reply.
  2. If you need to execute javascript, use an actual web browser. I recommend the chrome-php/chrome project

    $ composer require chrome-php/chrome

    One interesting note about the specific website you’re scraping, it’s entirely generated with Javascript, everything is in the runtime.31b3be7ffe3f39288917.js file, and the javascript is slow, using ~2-3 seconds to actually generate the page, the page is generated long after the DOMContentLoaded event is actually fired. try

    declare(strict_types=1);
    require_once('vendor/autoload.php');
    // chrome
    $browserFactory = new HeadlessChromiumBrowserFactory('chromium');
    $browser = $browserFactory->createBrowser([
        'headless' => true,
        'windowSize'   => [1920, 1080],
    ]);
    $page = $browser->createPage();
    $navigate = $page->navigate('https://secure.e-konsulat.gov.pl');
    $navigate->waitForNavigation(HeadlessChromiumPage::NETWORK_IDLE);
    $html = $page->evaluate('document.documentElement.outerHTML')->getReturnValue();
    var_dump([
        'html' => $html,
    ]);
    
    • HeadlessChromiumPage::NETWORK_IDLE fires after the page is actually generated 🙂
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search