Not able to extract the link in node "<enc:enclosure rdf:resource..." from xml file with PHP

SamuElias
February 20, 2023
246 views
0 votes
2 Answers

currently I’d like to parse a string from a xml-file (RSS) to retrieve and display the image link in node: "<enc:enclosure rdf:resource="https://www.science.org/… .jpg".
To me it looks like something with two different namespaces in it. And so far I found no similar question or example to get this working.
In attached simplified code example you can see what is working as expected and that the link in node : "<enc:enclosure rdf:resource="https://www.science.org/… .jpg" is not displayed that way.

<?php
$xml_string = '<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:prism="http://prismstandard.org/namespaces/basic/2.0/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:enc="http://purl.oclc.org/net/rss_2.0/enc/" xmlns:cc="http://web.resource.org/cc/" xmlns="http://purl.org/rss/1.0/">
<item>
      <title><![CDATA[Canada moves to ban funding for ‘risky’ foreign collaborations]]></title>
      <link>https://www.science.org/content/article/canada-moves-ban-funding-risky-foreign-collaborations</link>
      <description><![CDATA[China is seen as main target in rejecting joint projects with certain foreign entities]]></description>
      <enc:enclosure rdf:resource="https://www.science.org/do/10.1126/science.adh2317/rss/_20230217_nid_canada_china.jpg" enc:length="165061" enc:type="image/jpeg" />
      <dc:title><![CDATA[Canada moves to ban funding for ‘risky’ foreign collaborations]]></dc:title>
      <dc:identifier>doi:10.1126/science.adh2317</dc:identifier>
      <dc:date>2023-02-17T05:55:00Z</dc:date>
      <dc:creator>Jeffrey Mervis</dc:creator>
      <prism:publicationName><![CDATA[Canada moves to ban funding for ‘risky’ foreign collaborations]]></prism:publicationName>
      <prism:coverDate>2023-02-17T05:55:00Z</prism:coverDate>
      <prism:coverDisplayDate>2023-02-17T05:55:00Z</prism:coverDisplayDate>
      <prism:doi>10.1126/science.adh2317</prism:doi>
      <prism:url>https://www.science.org/content/article/canada-moves-ban-funding-risky-foreign-collaborations</prism:url>
</item></rdf:RDF>';
$xml = simplexml_load_string($xml_string);

foreach ($xml->item as $item) {
if($item->children('http://purl.oclc.org/net/rss_2.0/enc/')) {
foreach ($item->children('http://purl.oclc.org/net/rss_2.0/enc/') as $eintrag1) {
echo'<pre>';print_r($eintrag1);echo'</pre>'; // is working  
echo 'Length: ' . $eintrag1['length'] . '<br />'; // is working
$eintrag2 = $eintrag1->children('http://www.w3.org/1999/02/22-rdf-syntax-ns#');
echo'<pre>';print_r($eintrag2);echo'</pre>'; // is working
echo 'Resource: ' . $eintrag2['resource'] . '<br />'; // NOT working!!! Only empty output, but it's the link I would like to extract!
} }
}
?>

It looks simple and I thought I already managed those problems with my few PHP skills but none of my approaches (f.e. DOM, SimpleXML, xpath) brought me to the desired result.
If someone finds time to help me finding the answer I would be very appreciated. Thanks in advance.

Answers

- Kazz
- February 20, 2023 at 5:19 pm
- 0 votes
0
SimpleXML is known to have issues with namespaces, try DOM + DOMXPath instead
```
$dom = new DOMDocument;
$dom->loadXML(file_get_contents('https://www.science.org/rss/news_current.xml'));
$dxp = new DOMXPath($dom);
foreach($dxp->query('//enc:enclosure') as $enclosure) {
    echo 'Resource: ' . $enclosure->getAttribute('rdf:resource') . '<br />';
}
```
Login or Signup to reply.

SimpleXML does some implicit namespace switching. You used the explicit syntax for the children already, you can do the same for the attributes.

However I suggest defining a constant/variable with all the namespaces you are using. This will make you code a lot more readable. The keys can be the different from the prefixes in the document.

$xmlns = [
    'enc' => 'http://purl.oclc.org/net/rss_2.0/enc/',
    'rdf' => 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
    // defined in the XML as namespace for elements without a prefix
    'rss' => 'http://purl.org/rss/1.0/',
];

$rdf = simplexml_load_string(getXMLString());

foreach ($rdf->children($xmlns['rss'])->item as $item) {
    foreach ($item->children($xmlns['enc'])->enclosure as $enclosure) {
        echo 'Length: ' . $enclosure->attributes($xmlns['enc'])['length'] . "n"; 
        echo 'Resource: ' . $enclosure->attributes($xmlns['rdf'])['resource'] . "n"; 
    } 
}

function getXMLString() {
    return <<<'XML'
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:prism="http://prismstandard.org/namespaces/basic/2.0/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:enc="http://purl.oclc.org/net/rss_2.0/enc/" xmlns:cc="http://web.resource.org/cc/" xmlns="http://purl.org/rss/1.0/">
<item>
      <title><![CDATA[Canada moves to ban funding for ‘risky’ foreign collaborations]]></title>
      <link>https://www.science.org/content/article/canada-moves-ban-funding-risky-foreign-collaborations</link>
      <description><![CDATA[China is seen as main target in rejecting joint projects with certain foreign entities]]></description>
      <enc:enclosure rdf:resource="https://www.science.org/do/10.1126/science.adh2317/rss/_20230217_nid_canada_china.jpg" enc:length="165061" enc:type="image/jpeg" />
      <dc:title><![CDATA[Canada moves to ban funding for ‘risky’ foreign collaborations]]></dc:title>
      <dc:identifier>doi:10.1126/science.adh2317</dc:identifier>
      <dc:date>2023-02-17T05:55:00Z</dc:date>
      <dc:creator>Jeffrey Mervis</dc:creator>
      <prism:publicationName><![CDATA[Canada moves to ban funding for ‘risky’ foreign collaborations]]></prism:publicationName>
      <prism:coverDate>2023-02-17T05:55:00Z</prism:coverDate>
      <prism:coverDisplayDate>2023-02-17T05:55:00Z</prism:coverDisplayDate>
      <prism:doi>10.1126/science.adh2317</prism:doi>
      <prism:url>https://www.science.org/content/article/canada-moves-ban-funding-risky-foreign-collaborations</prism:url>
</item></rdf:RDF>
XML;
}

Output:

Length: 165061
Resource: https://www.science.org/do/10.1126/science.adh2317/rss/_20230217_nid_canada_china.jpg

DOM

DOM is more explicit and has a set of namespace aware methods with the suffix NS (for example getAttributeNS()).

DOMXpath::evaluate() allows for complex expressions to fetch nodes and scalar values:

$xmlns = [
    'enc' => 'http://purl.oclc.org/net/rss_2.0/enc/',
    'rdf' => 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
    // defined in the XML as namespace for elements without a prefix
    'rss' => 'http://purl.org/rss/1.0/',
];

$document = new DOMDocument;
$document->loadXML(getXMLString());
$xpath = new DOMXPath($document);
// register the namespaces 
foreach ($xmlns as $alias => $uri) {
    $xpath->registerNamespace($alias, $uri);
}
$items = [];
foreach($xpath->evaluate('//rss:item') as $itemNode) {
    $items[] = [
        // fetch "{http://purl.org/rss/1.0/}title" as string
        'title' => $xpath->evaluate('string(rss:title)', $itemNode),
        'enclosure' => [
            // fetch the attribute values
            'resource' => $xpath->evaluate('string(enc:enclosure/@rdf:resource)', $itemNode),
            'length' => $xpath->evaluate('number(enc:enclosure/@enc:length)', $itemNode)
        ],
    ];
}
var_dump($items);

Output:

array(1) {
  [0]=>
  array(2) {
    ["title"]=>
    string(66) "Canada moves to ban funding for ‘risky’ foreign collaborations"
    ["enclosure"]=>
    array(2) {
      ["resource"]=>
      string(85) "https://www.science.org/do/10.1126/science.adh2317/rss/_20230217_nid_canada_china.jpg"
      ["length"]=>
      float(165061)
    }
  }
}

Please signup or login to give your own answer.

Click here to cancel reply.

Not able to extract the link in node "<enc:enclosure rdf:resource…" from xml file with PHP

Answers

DOM