I’m trying to automate some features and I need to scarp a web page.
So, I’m using browserkit to make external request to website.
Everything seems good but there’s no useful result in my response because target page is using modern JavaScript.
Let’s take a look :
PHP file
require "./vendor/autoload.php";
use SymfonyComponentHttpClientHttpClient;
$client = HttpClient::create([
'max_redirects' => 7,
]);
$response = $client->request(
'GET',
'https://secure.e-konsulat.gov.pl'
);
$statusCode = $response->getStatusCode();
$contentType = $response->getHeaders()['content-type'][0];
$content = $response->getContent();
dd($content); //dd() is installed globally on my php installation
and this is $content
result:
^ """
<!DOCTYPE html><html lang="en"><head>
<meta charset="utf-8">
<title>System Zdalnej Rejestracji</title>
<base href="/">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="icon" type="image/x-icon" href="favicon.ico">
<link rel="preconnect" href="https://fonts.gstatic.com">
<style type="text/css">@font-face{font-family:'Material Icons';font-style:normal;font-weight:400;src:url(https://fonts.gstatic.com/s/materialicons/v139/flUhRq ▶
<style>*,:after,:before{box-sizing:border-box;}@media (prefers-reduced-motion:no-preference){:root{scroll-behavior:smooth;}}body{margin:0;font-family:var(--bs-f ▶
<body>
<!--[if lt IE 11]>
<div style="padding: 1em; border-bottom: 1px solid #0052a5;">Serwis Ministerstwa Spraw Zagranicznych Rzeczypospolitej Polskiej</div>
<div style="font-size: 1.5em; margin: 4em auto; max-width: 1024px; text-align: center;">
<br />
<p>Dear User,</p><br />
<p>Your browser's version is not supported by the application.</p><br /><br />
Please, actualize Your browser or use another one
</div>
<![endif]-->
<app-root></app-root>
<script src="runtime.31b3be7ffe3f39288917.js" defer></script><script src="polyfills.da272157cf92c2e29a93.js" defer></script><script src="main.375c8bb1538d8323b9 ▶
</body></html>
"""
and as you can see in this lines:
<p>Dear User,</p><br /> <p>Your browser's version is not supported by the application.</p><br /><br /> Please, actualize Your browser or use another one
UPDATE:
I did use Browser
class either but there’s same result:
$browser = new SymfonyComponentBrowserKitHttpBrowser(HttpClient::create());
$browser->request('GET', 'https://secure.e-konsulat.gov.pl');
$response = $browser->getResponse();
dd($response);
Do you know how to solve this issue?
And as a mention it’s my first experience on web scraping so detailed answers are appreciated.
Thanks in advance
2
Answers
Usually this is caused by software (in this case the receiving webserver) validating your browser and to check if your browser can handle the website. It could be a solution to ‘mimick’ the request of a ‘state of the art’ browser. This can done by providing your request with a
User-Agent
.Using
HttpClient
you can try the following:Using
BrowserKit
you can try the following:The more you ‘mimick’ a normal request made by a regular visitor, the more lickely you will receive a decent response. Another example can be found here.
If you need to execute javascript, use an actual web browser. I recommend the chrome-php/chrome project
One interesting note about the specific website you’re scraping, it’s entirely generated with Javascript, everything is in the
runtime.31b3be7ffe3f39288917.js
file, and the javascript is slow, using ~2-3 seconds to actually generate the page, the page is generated long after theDOMContentLoaded
event is actually fired. try