skip to Main Content

I want to create a universal website crawler using PHP.

By using my web application, a user will input any URL, will provide input on what he needs to get from given site and will click on Start button.

Then my web application will begin to get data from source website.

I am loading the page in iframe and using jQuery I get class and tags name of specific area from user.

But when I load external website like ebay or amazon etc it does not work, as these site are restricted. Is there any way to resolve this issue, so I can load any site in iFrame? Or is there any alternative to what I want to achieve?

I am actually inspired by mozenda, a software developed in .NET, http://www.mozenda.com/video01-overview/.

They load a site in a browser control and it’s almost the same thing.

3

Answers


  1. Take a look at using the file_get_contents function in PHP.

    You may have better success in retrieving the HTML for a given site like this:

    $html = file_get_contents('http://www.ebay.com');
    
    Login or Signup to reply.
  2. You can sub in what element you’re looking for in the second foreach loop within the following script. As is the script will gather up the first 100 links on cnn’s homepage and put them in a text file named “cnnLinks.txt” in the same folder in which this file is located.

    Just change the $pre, $base, and $post variables to whatever url you want to crawl! I separated them like that to change through common websites faster.

    <?php
        set_time_limit(0);
        $pre = "http://www.";
        $base = "cnn";
        $post = ".com";
        $domain = $pre.$base.$post;
        $content = "google-analytics.com/ga.js";
        $content_tag = "script";
        $output_file = "cnnLinks.txt";
        $max_urls_to_check = 100;
        $rounds = 0;
        $domain_stack = array();
        $max_size_domain_stack = 1000;
        $checked_domains = array();
        while ($domain != "" && $rounds < $max_urls_to_check) {
            $doc = new DOMDocument();
            @$doc->loadHTMLFile($domain);
            $found = false;
            foreach($doc->getElementsByTagName($content_tag) as $tag) {
                if (strpos($tag->nodeValue, $content)) {
                    $found = true;
                    break;
                }
            }
            $checked_domains[$domain] = $found;
            foreach($doc->getElementsByTagName('a') as $link) {
                $href = $link->getAttribute('href');
                if (strpos($href, 'http://') !== false && strpos($href, $domain) === false) {
                    $href_array = explode("/", $href);
                    if (count($domain_stack) < $max_size_domain_stack &&
                        $checked_domains["http://".$href_array[2]] === null) {
                        array_push($domain_stack, "http://".$href_array[2]);
                    }
                };
            }
            $domain_stack = array_unique($domain_stack);
            $domain = $domain_stack[0];
            unset($domain_stack[0]);
            $domain_stack = array_values($domain_stack);
            $rounds++;
        }
    
        $found_domains = "";
        foreach ($checked_domains as $key => $value) {
            if ($value) {
                $found_domains .= $key."n";
            }
        }
        file_put_contents($output_file, $found_domains);
    ?>
    
    Login or Signup to reply.
  3. You can’t crawl a site on the client-side if the target website is returning the "X-Frame-Options: SAMEORIGIN" response header (see @mc10’s duplicate link in the question comments). You must crawl the target site using server-side functionality.

    The following solution might be suitable if wget has all of the options that you need. wget -r will recursively crawl a site and download the documents. It has many useful options, like translating absolute embedded urls to relative, local ones.

    Note: wget must be installed in your system for this to work. I don’t know which operating system you’re running this on, but on Ubuntu, it’s sudo apt-get install wget to install wget.

    See: wget --help for additional options.

    <?php
    
        $website_url = $_GET['user_input_url'];
    
        //doesn't work for ipv6 addresses
        //http://php.net/manual/en/function.filter-var.php
        if( filter_var($website_url, FILTER_VALIDATE_URL) !== false ){
    
            $command = "wget -r " + escapeshellarg( $website_url );
            system( $command );
    
            //iterate through downloaded files and folders
    
        }else{
            //handle invalid url        
    
        }
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search