skip to Main Content

I use the PHP cURL code from this answer https://stackoverflow.com/a/46834320/12616388. When I run the script on localhost I get the desired output. If I run it from my web server, I retrieve a captcha to verify that I am not a bot. I am new to this topic and would like to know the cause. My code:

$request = array();
//$request[] = 'host:www.amazon.com';
$request[] = 'Connection: keep-alive';
$request[] = 'Pragma: no-cache';
$request[] = 'Cache-Control: no-cache';
$request[] = 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8';
$request[] = 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0';//Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36';
$request[] = 'DNT: 1';
$request[] = 'Accept-Encoding: gzip, deflate';
$request[] = 'Accept-Language: en-US,en;q=0.8';

$url = 'https://www.amazon.de/Wenn-Dunkeln-Sterne-funkeln-Puste-Licht-Buch/dp/3480236529/ref=sr_1_3?keywords=buch&qid=1670662644&sr=8-3';
$ch = curl_init($url);

curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_POST, false);
curl_setopt($ch, CURLOPT_HTTPHEADER, $request);
curl_setopt($ch, CURLOPT_ENCODING,"");
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($ch, CURLOPT_TIMEOUT,10);
curl_setopt($ch, CURLOPT_FAILONERROR,true);
$output = curl_exec($ch);

EDIT:
I slightly modified the code (random user agent string and multiple cURL requests in a loop) but the problems are the same: on localhost no problems on the webserver I get the captchas).

$user_agents = array('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Safari/537.36', 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (K HTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36', 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:59.0) Gecko/20100101 Firefox/59.0', 'Mozilla/5.0 (X11; Linux x86_64; rv:58.0) Gecko/20100101 Firefox/58.0', 'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:17.0) Gecko/20100101 Firefox/17.0');
foreach ($products as $key => $value) {
    $request = array();
    $request[] = 'Connection: keep-alive';
    $request[] = 'Pragma: no-cache';
    $request[] = 'Cache-Control: no-cache';
    $request[] = 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8';
    $request[] = 'User-Agent: ' . $user_agents[array_rand($user_agents)];
    $request[] = 'DNT: 1';
    $request[] = 'Accept-Encoding: gzip, deflate';
    $request[] = 'Accept-Language: en-US,en;q=0.8';
    $url = $value['url'];
    $ch = curl_init($url);

    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($ch, CURLOPT_POST, false);
    curl_setopt($ch, CURLOPT_HTTPHEADER, $request);
    curl_setopt($ch, CURLOPT_ENCODING,"");
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
    curl_setopt($ch, CURLOPT_TIMEOUT,10);
    curl_setopt($ch, CURLOPT_FAILONERROR,true);
    $output = curl_exec($ch);
    ...
}

2

Answers


  1. Since it only gets triggered when you’re on the server, the captcha probably tracks IP addresses. Any chance it’s a Recaptcha?

    Whatever the captcha is, one thing that could help is solving the captcha from the webserver’s IP address.

    If the webserver has a desktop environment, connect via VNC (or whatever you ususally use for connecting), open a browser and solve the captcha.

    If it does not, try setting up a VPN server on the webserver (this one seems easy enough), connect to the VPN from your computer (and thus get the same IP address as your webserver), open a browser and solve the captcha.

    Another option is creating a proxy server which will achieve similar result to VPN.

    Sadly you’ll have to do it from time to time because that’s exactly what captcha is for – preventing automated scrapping of websites via bots.

    Login or Signup to reply.
  2. To fix this, you can try to include additional headers or cookies in your cURL request to make it appear more like a real user. For example, you could include the User-Agent header to specify the browser and operating system that your cURL request is coming from, and also you could include the Cookie header to include cookies that are typically sent by a real user.

    For example:

    $ch = curl_init();
    
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    
    // Include additional headers
    curl_setopt($ch, CURLOPT_HTTPHEADER, array(
      'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36',
      'Cookie: __cfduid=<cookie-data-goes-here>'
    ));
    
    $response = curl_exec($ch);
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search