skip to Main Content

I am trying to scrape the h1 element from the HTML body of a particular website:

<?php
    error_reporting(E_ALL);
    ini_set('display_errors', 1);
    header('Content-Type: text/plain; charset=utf-8');
    header('Access-Control-Allow-Origin: *');
    header('Access-Control-Allow-Methods: POST, GET, OPTIONS');

    if(isset($_POST["url"])){
        $user_agent = "Mozilla/5.0 (Macintosh; 
        Intel Mac OS X 10_14_4) AppleWebKit/537.36 
        (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36"; 
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 3600);
        curl_setopt($ch, CURLOPT_TIMEOUT, 3600);
        curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
        curl_setopt($ch, CURLOPT_VERBOSE, true);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_URL,$url);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
        $html=curl_exec($ch);
        if (!curl_errno($ch)){
            $resultStatus = curl_getinfo($ch, CURLINFO_HTTP_CODE);
            if ($resultStatus == 200) {
                @$DOM = new DOMDocument;
                @$DOM->loadHTML('<?xml encoding="UTF-8">' . $html);
                echo $DOM->getElementsByTagName('h1')[0]->textContent;
            }
            else
                echo "Error: ".$resultStatus;
        }
        else
            echo "No h1 found ".curl_error($ch)
    }
?>

I am trying to find the h1 element of this particular website:

https://neindiabroadcast.com/2023/03/24/bharat-gaurav-train-flagged-off-from-guwahati-for-arunachal-pradesh/

But I keep getting the following error

No h1 found Failed to connect to neindiabroadcast.com port 443 after 15402 ms: Connection timed out

I tried increasing the connection timeout and execution timeout to 3600 seconds, but the result is still the same. How do I resolve this issue?

EDIT #1: I’ve discovered that the error only shows in my live server. When I run the code in my local server, the data is fetched succesfully.

2

Answers


  1. I test your code. Except for some syntax errors your code is working fine. here try this one:

    <?php
        error_reporting(E_ALL);
        ini_set('display_errors', 1);
        // header('Content-Type: text/plain; charset=utf-8');
        header('Access-Control-Allow-Origin: *');
        header('Access-Control-Allow-Methods: POST, GET, OPTIONS');
    
       if(isset($_GET['url'])){
    
        $url = $_GET['url'];
            $user_agent = "Mozilla/5.0 (Macintosh; 
            Intel Mac OS X 10_14_4) AppleWebKit/537.36 
            (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36"; 
            $ch = curl_init();
            curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
            curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
            curl_setopt($ch, CURLOPT_VERBOSE, true);
            curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
            curl_setopt($ch, CURLOPT_URL,$url);
            curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
            $html=curl_exec($ch);
    
            if (!curl_errno($ch)){
                $resultStatus = curl_getinfo($ch, CURLINFO_HTTP_CODE);
                if ($resultStatus == 200) {
                    @$DOM = new DOMDocument;
                    @$DOM->loadHTML('<?xml encoding="UTF-8">' . $html);
                    echo $DOM->getElementsByTagName('h1')[0]->textContent;
                }
                else
                    echo "Error: ".$resultStatus;
            }
            else
                echo "No h1 found ".curl_error($ch);
       } 
    ?>
    
    <form>
        <input type="text" name="url">
        <button type="submit">Submit </button>
    </form>
    
    Login or Signup to reply.
  2. The timeout could be due to a number of reasons:

    • Network configuration (as in the machine you run the code on can’t reach the requested domains/URLs). If the destination URLs are of sites hosted on the same server (or same network) as the running script, there may be a need to tell the server how to resolve the domain name that is hosted on that same server, sometimes.
    • (more likely) the requested URLs might be behind firewalls (for example CloudFlare) that may drop the packets, specially when coming from automated tools like your script (perhaps considered a bot).

    I’d suggest to use the "curl" command-line tool to check the URLs that resulted in timeout, on the same machine that is running the PHP script, using the "-vvv" (high verbosity options). Check the output, and if the result is the same (timing out as when executed in PHP), the problem would be not with your code but with the underlying network / system configuration.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search