skip to Main Content

I am developing an application which crawls data from another website. That website is protected by a login, but I have an account there. My application should login to that website and return the content of the protected web page. I managed to get this to work in Python using the requests package.

Now I want to accomplish the same thing in PHP using cURL. Unfortunately, until this moment, I couldn’t make this work, and I would like your help.

Before you can login, the website requires a verification token. So, you first have to obtain the Token, and then login afterwards. Here is my (working!) Python code:

import requests

url = "https://www.mywebsite.com/login.php"
headers = {'user-agent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36"}
s = requests.Session()

// Get token
r1 = s.get(url, headers = headers)
cacheToken = ExtractTokenFromText(r1.text) // some function defined by me

// Login
data = {'username': 'myusername', 
         'password': 'mypassword', 
         '__RequestVerificationToken': cacheToken}
r2 = s.post(url, headers = headers, data = data)
my_content = r2.text

Now I try to implement the same functionality in PHP using the cURL library. My PHP code is:

$url = "https://www.mywebsite.com/login.php";
$ch = curl_init();
    
// Get token
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADERFUNCTION, "curlResponseHeaderCallback");
$r1 = curl_exec($ch);

function curlResponseHeaderCallback($ch, $headerLine) {
    global $cache_token;
    $cache_token = ExtractTokenFromHeader($headerline); // some function defined by me
    return strlen($headerLine); // Needed by curl
}

// Login
$post_data = array('username' => $myusername,
                 'password' => $mypassword,
                 '__RequestVerificationToken' => $cache_token);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($post_data));
$r2 = curl_exec($ch);
$my_content = $r2;

The PHP file correctly receives the $cache_token, so the GET request is executed correctly. Unfortunately, the POST request is not working, because the PHP file gives the following error message:

"Your antiforgery token is invalid." with a HTTP 400 Bad Request.

I tried many things to fix the problem, but none of them work:

  • Adding a user agent curl_setopt($ch, CURLOPT_USERAGENT,
    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36");
  • Adding curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); to the second request.
  • Adding curl_setopt($ch, CURLOPT_COOKIEJAR, $cookiefile); with one of the following three options:
$cookiefile = getcwd() . '/cookie.txt';
$cookiefile = __DIR__ . '/cookie.txt';
$cookiefile = 'cookie.txt';
  • Adding curl_setopt($ch, CURLOPT_COOKIEFILE, $cookiefile); (as above)
  • Adding curl_setopt($ch,CURLOPT_FOLLOWLOCATION,TRUE);
  • Enable error reporting to find the problem
  • Using their API directly (it only works if you pay them)

I would like to stress that the Python version works, but the PHP version doesn’t. (So I’m sure there are no missing or hidden parameters, captcha’s to handle with, etc.)

My question is similar to this and this question, and in a lesser way to this and this question, but their solutions either don’t work for me and there are no answers at all…

2

Answers


  1. Chosen as BEST ANSWER

    So, thanks to @Steven Penny and this great YouTube video I finally managed to get it working. Key differences:

    • Completely different way of getting the token from the first GET: Not using CURLOPT_RETURNTRANSFER but directly from $r1.text
    • Added curl_setopt($ch, CURLOPT_COOKIEJAR, $cookiefile); for both calls
    • Added curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); for the second call only

    My final script:

    $url = "https://www.mywebsite.com/login.php";
    $cookiefile = 'cookie.txt';
    $ch = curl_init();
        
    // Get token
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_COOKIEJAR, $cookiefile);
    $r1 = curl_exec($ch);
    
    $dom = new DOMDocument;
        $dom->loadHTML($response);
        $tags = $dom->getElementsByTagName('input');
        $token = '';
        for($i=0; $i<$tags->length; $i++) 
        {
            $grab = $tags->item($i);
            if ($grab->getAttribute('name') === '__RequestVerificationToken')
            {
                $token = $grab->getAttribute('value');
            }
        }
    
    // Login
    $post_data = array('username' => $myusername,
                     'password' => $mypassword,
                     '__RequestVerificationToken' => $cache_token);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_POST, true);
    curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($post_data));
    curl_setopt($ch, CURLOPT_COOKIEJAR, $cookiefile);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    $r2 = curl_exec($ch);
    $my_content = $r2;
    

  2. So it will depend on the site of course, but I think I had a similar situation with GitHub. I have a GitHub login, and I wanted to programmatically access some info that wasn’t available with the API. To login with GitHub:

    <?php
    $get_r = curl_init('https://github.com/login');
    curl_setopt($get_r, CURLOPT_RETURNTRANSFER, true);
    
    # Get response cookie from login page. "curl_close" creates the file.
    curl_setopt($get_r, CURLOPT_COOKIEJAR, 'github.txt');
    $log_s = curl_exec($get_r);
    curl_close($get_r);
    
    # Get authenticity token
    preg_match('/name="authenticity_token" value="([^"]*)"/', $log_s, $auth_a);
    $post_m['authenticity_token'] = $auth_a[1];
    
    # Set username
    $post_m['login'] = getenv('USERNAME');
    
    # Set password
    $post_m['password'] = getenv('PASSWORD');
    
    $post_r = curl_init('https://github.com/session');
    curl_setopt($post_r, CURLOPT_COOKIEFILE, 'github.txt');
    curl_setopt($post_r, CURLOPT_POSTFIELDS, $post_m);
    curl_exec($post_r);
    

    Then after that, you can make GET requests with CURLOPT_COOKIEFILE set to
    github.txt.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search