So I am trying to query the following URL: http://mil.sagepub.com/content/17/2/227.short
Here’s the situation: On a browser such as Chrome or Safari it will:
- 307 to https://mil.sagepub.com/content/17/2/227.short and then
- 301 to
https://journals.sagepub.com/doi/abs/10.1177/03058298880170020901 - which returns 200
On cURL, it will:
- 307 to https://mil.sagepub.com/content/17/2/227.short
- which returns 503
So naturally, I go to Chrome and copy the request to https://mil.sagepub.com/content/17/2/227.short as a bash cURL command. I paste it into bash, and I get a 503. I try copying the Safari request to the same page as a bash cURL command, and also a 503. So seemingly two cURL requests formatted to perfectly imitate the browser request returns a 503.
On my PHP cURL options, I try and experiment with different options, but it also only returns a 503. So I have 3 different OSs and PHP’s cURL library getting 503 responses, while web browsers get a 200 OK response.
Here is the outgoing request my PHP code tried to send with cURL:
GET /content/17/2/227.short HTTP/2
Host: mil.sagepub.com
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36
authority: mil.sagepub.com
accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
accept-encoding: gzip, deflate, br
upgrade-insecure-requests: 1
cache-control: max-age=0
connection: keep-alive
keep-alive: 300
accept-charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
accept-language: en-US,en;q=0.9,de;q=0.8
dnt: 1
sec-ch-ua: "Google Chrome";v="105", "Not)A;Brand";v="8", "Chromium";v="105"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Windows"
sec-fetch-dest: document
sec-fetch-mode: navigate
sec-fetch-site: none
sec-fetch-user: ?1
The method that sets all of the curl options and generates the above request header is as below:
$url = "https://mil.sagepub.com/content/17/2/227.short"
$full = true
$tor = false
$httpVersion = CURL_HTTP_VERSION_2_0 // HTTP/1.1 doesn't seem to work in this page
$this->userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"
$this->curlTimeoutFull = 60
protected function getCurlOptions( $url, $full = false, $tor = false, $httpVersion = CURL_HTTP_VERSION_NONE ) {
$requestType = $this->getRequestType( $url );
if ( $requestType == "MMS" ) {
$url = str_ireplace( "mms://", "rtsp://", $url );
}
$options = [
CURLOPT_URL => $url,
CURLOPT_HEADER => 1,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_AUTOREFERER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_TIMEOUT => $this->curlTimeoutNoBody,
CURLOPT_SSL_VERIFYPEER => false,
CURLOPT_COOKIEJAR => sys_get_temp_dir() . "checkifdead.cookies.dat",
CURLOPT_HTTP_VERSION => $httpVersion,
CURLINFO_HEADER_OUT => 1
];
if ( $requestType == "RTSP" || $requestType == "MMS" ) {
$header = [];
$options[CURLOPT_USERAGENT] = $this->mediaAgent;
} else {
// Properly handle HTTP version
// Emulate a web browser request but make it accept more than a web browser
if ( in_array( $httpVersion, [CURL_HTTP_VERSION_1_0, CURL_HTTP_VERSION_1_1, CURL_HTTP_VERSION_NONE] ) ) {
$header = [
// @codingStandardsIgnoreStart Line exceeds 100 characters
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
// @codingStandardsIgnoreEnd
'Accept-Encoding: gzip, deflate, br',
'Upgrade-Insecure-Requests: 1',
'Cache-Control: max-age=0',
'Connection: keep-alive',
'Keep-Alive: 300',
'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7',
'Accept-Language: en-US,en;q=0.9,de;q=0.8',
'Pragma: '
];
} elseif ( in_array( $httpVersion, [CURL_HTTP_VERSION_2, CURL_HTTP_VERSION_2_0, CURL_HTTP_VERSION_2_PRIOR_KNOWLEDGE, CURL_HTTP_VERSION_2TLS] ) ) {
$parsedURL = $this->parseURL( $url );
$header = [
'authority: ' . $parsedURL['host'],
//':method: get',
//':path: ' . $parsedURL['path'],
//':scheme: ' . strtolower( $parsedURL['scheme'] ),
// @codingStandardsIgnoreStart Line exceeds 100 characters
'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
// @codingStandardsIgnoreEnd
'accept-encoding: gzip, deflate, br',
'upgrade-insecure-requests: 1',
'cache-control: max-age=0',
'connection: keep-alive',
'keep-alive: 300',
'accept-charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7',
'accept-language: en-US,en;q=0.9,de;q=0.8',
'dnt: 1'
];
if ( $requestType == "HTTPS" ) {
$header[] = 'sec-ch-ua: "Google Chrome";v="105", "Not)A;Brand";v="8", "Chromium";v="105"';
$header[] = 'sec-ch-ua-mobile: ?0';
$header[] = 'sec-ch-ua-platform: "' . $this->getRequestPlatform() . '"';
$header[] = 'sec-fetch-dest: document';
$header[] = 'sec-fetch-mode: navigate';
$header[] = 'sec-fetch-site: none';
$header[] = 'sec-fetch-user: ?1';
}
}
if ( $this->customUserAgent === false ) {
$options[CURLOPT_USERAGENT] = $this->userAgent;
} else {
$options[CURLOPT_USERAGENT] = $this->customUserAgent;
}
}
if ( $requestType == 'FTP' ) {
$options[CURLOPT_FTP_USE_EPRT] = 1;
$options[CURLOPT_FTP_USE_EPSV] = 1;
$options[CURLOPT_FTPSSLAUTH] = CURLFTPAUTH_DEFAULT;
$options[CURLOPT_FTP_FILEMETHOD] = CURLFTPMETHOD_SINGLECWD;
if ( $full ) {
// Set CURLOPT_USERPWD for anonymous FTP login
$options[CURLOPT_USERPWD] = "anonymous:[email protected]";
}
}
if ( $full ) {
// Extend timeout since we are requesting the full body
$options[CURLOPT_TIMEOUT] = $this->curlTimeoutFull;
$options[CURLOPT_HTTPHEADER] = $header;
if ( $requestType != "MMS" && $requestType != "RTSP" ) {
$options[CURLOPT_ENCODING] = 'gzip, deflate, br';
}
$options[CURLOPT_USERAGENT] = $this->userAgent;
} else {
$options[CURLOPT_NOBODY] = 1;
}
if ( $tor && self::$torEnabled ) {
$options[CURLOPT_PROXY] = self::$socks5Host . ":" . self::$socks5Port;
$options[CURLOPT_PROXYTYPE] = CURLPROXY_SOCKS5_HOSTNAME;
$options[CURLOPT_HTTPPROXYTUNNEL] = true;
} else {
$options[CURLOPT_PROXYTYPE] = CURLPROXY_HTTP;
}
return $options;
}
My question is, what am I missing here?
3
Answers
Unfortunately, this appears to be CloudFlare using TLS fingerprinting to distinguish cURL requests from actual browsers. There doesn't likely exist a means to work around this. Please correct me if I'm wrong here.
UPDATE
Last night I had not time to look at the HTML. Now that I have, it’s disappointing. I’ve taken a little more time to look at the HTML. It is the HTML for the Cloudflare "Checking Security" message.
I retrieved the response header. And it is a 503 response.
I do not have security certificates for my curl and don’t really have the time do install the now.
This one is going to be a challenge. I wish I had more time now because I like a challenge. I may work on this in the near future (weeks/months).
If you are going to continue you may need some of these curl options.
Start with
CURLOPT_CAINFO / CURLOPT_CAPATH
These are the SSL security error codes.
Response Headers
End of Update
I tried the curl below and it appeared to work.
When I used this header it did not work.
Notice the header() below.
This is the response
Using
header("Content-Type: text/html; UTF-8");
I got this:If the issue here actually is TLS fingerprinting then using an HTTP proxy like
mitmproxy
, HTTP Toolkit or NaïveProxy might be a possible workaround. For details see disscussion here and here. Another option could be to usecurl-impersonate
as pointed here.