skip to Main Content

I’m newbie for xml files related stuff. i’ve stuck with an issue.

I have a mysql query which fetches url data nearly 5000 rows (1 row contains 1 url).
so i’ve implemented a cron which fetches 1000 rows at time from mysql with pagination. i need to do some validations on the urls and should append the valid urls in an xml file.

Here is my code

public function urlcheck()
    {
        $xFile = $this->base_path."sitemap/path/urls.xml";
        $page = 0;
        $cache_key = 'valid_urls';
        $page = $this->cache->redis->get($cache_key);
        if(!$page){
            $page=0;
        }

        $xFile = simplexml_load_file($xFile);

        $this->load->model('productnew/productnew_es6_m');
        $urls= $this->db->query("SELECT url FROM product_data where `active` = 1 limit ".$page.",1000")->result();

        $dom = new DOMDocument('1.0','UTF-8');
        $dom->formatOutput = true;      
        $root = $dom->createElement('urlset');
        $root->setAttribute('xsi:schemaLocation', 'http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd');
        $root->setAttribute('xmlns:xsi', 'http://www.w3.org/2001/XMLSchema-instance');
        $root->setAttribute('xmlns', 'http://www.sitemaps.org/schemas/sitemap/0.9');
        $dom->appendChild($root);
        

        foreach($urls as $val)
        {   
            // validations here 
            $url = $dom->createElement('url');
            $root->appendChild($url);

            $lastmod = $dom->createElement('lastmod', date("Y-m-d"));
            $url->appendChild($lastmod);

            $page++;
        }

        $dom->saveXML();
        $dom->save($xFile) or die('XML Create Error');
        
        if(sizeof($urls) == 0){
            $page = 0;
        }
        print_r($page);
        $this->cache->redis->save($cache_key, $page, 432000);
        // echo '<xmp>'. $dom->saveXML() .'</xmp>';
        // $dom->saveXML();
        // $dom->save($xFile) or die('XML Create Error');
        
    }

After my first cron execution, 300 valid urls out of 1000 urls are saved to xml file,
Now lets say In my second cron execution i have 200 valid urls out of 1000.

My expected result is to append these 200 to the existing xml file so that my xml file contains total 500 valid urls, and xml file should get refresh after 5000 urls as i mentioned above.

But after executing the cron every time, old url data is being replaced with latest once.

I was wondering how do I save the url values without overwriting the XML.
Thank you in Advance!

2

Answers


  1. As per the comment above you are opening the file with one api (SimpleXML) but saving a new document with DOMDocument – thus overwriting previous work. Without SimpleXML perhaps you can try like this – though it is untested.

    public function urlcheck(){
        
        $file=$this->base_path."sitemap/path/urls.xml";
        $cache_key='valid_urls';
        $page=$this->cache->redis->get($cache_key);
        
        if(!$page)$page=0;
        
        $dom=new DOMDocument('1.0','UTF-8');
        $dom->formatOutput = true;
        
        $col=$dom->getElementsByTagName('urlset');
        if( !empty( $col ) )$root=$col->item(0);
        else{
            $root=$dom->createElement('urlset');
            $dom->appendChild( $root );
            
            $root->setAttribute('xsi:schemaLocation','http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd');
            $root->setAttribute('xmlns:xsi','http://www.w3.org/2001/XMLSchema-instance');
            $root->setAttribute('xmlns','http://www.sitemaps.org/schemas/sitemap/0.9');     
        }
        
        # does a `page` node exist - if so use the value as the $page variable
        $col=$com->getElementsByTagName('page');
        if( !empty( $col ) )$page=intval( $col->item(0)->nodeValue );
        
        
        $this->load->model('productnew/productnew_es6_m');
        $urls=$this->db->query("SELECT `url` FROM `product_data` where `active` = 1 limit ".$page.",1000")->result();
        
        foreach( $urls as $val ){
            $url = $dom->createElement('url');
            $root->appendChild($url);
    
            $lastmod = $dom->createElement('lastmod', date("Y-m-d"));
            $url->appendChild($lastmod);
    
            $page++;
        }
        
        
        $node=$dom->createElement( 'page', $page );
        $root->insertBefore( $node, $root->firstChild );
        
        
        if( empty( $urls ) )$page=0;
        $dom->save( $file );
        $this->cache->redis->save( $cache_key, $page, 432000 );
    }
    
    Login or Signup to reply.
  2. Appending to the document looks fine, but you don’t open the file to which you want to append to from disk thought. Therefore on each page you start with 0 urls in the XML and append to the empty root node.

    But after executing the cron every time, old url data is being replaced with latest once.

    This is exactly the behaviour you describe and which sounds like you don’t load the XML file in the first place, just write it.

    So the question perhaps is how to open an XML file, append looks good by your description already.

    Let’s review, by reversing the introduction sentences of your question:

    I need to do some validations on the urls and should append the valid urls in an xml file.

    so i’ve implemented a cron which fetches 1000 rows at time from mysql with pagination.

    I have a mysql query which fetches url data nearly 5000 rows (1 row contains 1 url).

    Assuming the file to append each 1000 url-set to is already on disk (page 2-5), you would need to append. If however on page 1 the file would already be on disk, you would append to some other page 1-5.

    So it looks like you have written the code only for when you’re on the first page – to create a new document (and append to it).

    And despite your question, appending does work, you write it yourself:

    old url data is being replaced with latest once.

    The only thing that does not work is to open the file on page 2 – 5.

    So let’s rephrase the question: How to open an XML file?

    But first of all, the variable $page is not meant to stand for page as in page 1 – 5 above. It’s just a variable with a questionable name and $page stands for the number of URLs processed so far in the cycle and not for the page in the pagination.

    Regardless of its name, I’ll use it for its value for this answer.

    So now lets open the existing document for appending when $page is not 0:

    ...
    
    $dom = new DOMDocument('1.0','UTF-8');
    $dom->formatOutput = true;
    
    if ($page !== 0) {
        $dom->load(dom_import_simplexml($xFile)->ownerDocument->documentURI)    ​
    }
    
    
    $col=$dom->getElementsByTagName('urlset');
    
    ...
    

    only on the first run you’ll have the described behaviour that the file is created new – and in that case it’s fine (on the first run $page === 0).

    in any other case $page is not 0 and the file is opened from disk.

    I’ve left the other parts of your code alone so that this example is only introducing this 3-line if-clause.

    The documentation for the load($file) function is available in the PHP docs, just in case you missed it so far:

    Try to not re-use the same variable names if you want to come up to speed. Here I had to recycle a whole SimpleXMLElement and import it into DOM only to obtain the original xml-file-path to open the document – which was not available as plain string any longer despite it once was under the variable $xFile. But that just as a comment in the margin.

    And as you’re already using Redis, you perhaps may want to queue the URLs into it and process from there, then you’ll likely not need the database paging. See Lists of the Redis Data-Types.

    You can then also put the good URLs in there in a second list.

    With two lists you can even constantly check the progress in Redis directly.

    And when finally done, you can write the whole file at once in one transaction out of the good URLs in Redis.

    If you want to throw some more (minimal) tech on it, take a look at Beanstalkd.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search