skip to Main Content

I have been given an unusual goal and that is to create an XML with no CDATA element from a multidimensional array with these constraints:

  1. CDATA shouldn’t be used
  2. <, >, ‘, ", & – These chars should be encoded.

But when I’m setting the nodeValue after converting ‘ and " to &apos; and &quot;, the value is not being retained in the XML node and are being unescaped.

 private function convertElement(DOMElement $element, $value)
    {...
   $value = htmlspecialchars($value ?? '', ENT_XML1 | ENT_QUOTES); 
   $element->nodeValue = $value;

Node element is not retaining the encoded char value, and &apos and &quotes are decoded to ‘ and "

dd($element->nodeValue);
// &lt;p itemprop='description'&gt;'Test single quotes'&lt;/p&gt;itemprop="description">
// Should have retained
// &lt;p itemprop=&quot;description&quot;&gt;'Test single quotes'&lt;/p&gt;

And if i use appendChild(), it is causing the values to double escape.

//value = Test Ampersand &
 $textNode = $this->document->createTextNode($value);  
 $element->appendChild($textNode); 

//Test Amersand &amp;amp;

My issue seems to be related to this phenomenon DOMElement nodeValue inconsistant get vs set
I’ll appreciate any workaround or suggestions.
Thanks!

2

Answers


  1. Please try this

    function convertElement(DOMElement $element, $value) {
        $encodedValue = htmlspecialchars($value ?? '', ENT_XML1 | ENT_QUOTES);
        $textNode = $element->ownerDocument->createTextNode($encodedValue);
        while ($element->hasChildNodes()) {
            $element->removeChild($element->firstChild);
        }
        $element->appendChild($textNode);
        echo $element->ownerDocument->saveXML($element);
    }
    $doc = new DOMDocument();
    $doc->loadXML('<root><element><![CDATA[<p itemprop="description">'Test single quotes'</p>]]></element></root>');
    
    $element = $doc->getElementsByTagName('element')->item(0);
    $cdataContent = $element->nodeValue;
    convertElement($element, $cdataContent);
    echo $doc->saveXML();
    
    Login or Signup to reply.
  2. CDATASection nodes handle encodings differently, so it is understandable to avoid them. They are mostly for BC and human readability.

    Quotes are only needed to be escaped inside an attribute using them. The DOM serializer avoids unnecessary escaping. So you only option would be the write your own serializer.

    The nodeValue property implementation in PHP does not match the DOM standard and is imho broken. It will only partially escape input.

    The original DOM standard required you to create and add a text node. Current DOM (and PHP) has the textContent property.

    Here is an example:

    $data = "n, <, >, ', n, ", &";
    
    $document = new DOMDocument();
    $root = $document->appendChild($document->createElement('foo'));
    
    // create an child element with a textnode child
    $root
      ->appendChild($document->createElement('bar'))
      ->appendChild($document->createTextNode($data));
      
    // add an child element and set textContent
    $root
      ->appendChild($document->createElement('bar'))
      ->textContent = $data;
      
    $root
      ->appendChild($document->createElement('bar'))
      ->setAttribute('a', $data);
      
    $document->formatOutput = true;
    echo $document->saveXML();
    

    Output:

    <?xml version="1.0"?>
    <foo>
      <bar>
    , &lt;, &gt;, ', 
    , ", &amp;</bar>
      <bar>
    , &lt;, &gt;, ', 
    , ", &amp;</bar>
      <bar a="&#10;, &lt;, &gt;, ', &#10;, &quot;, &amp;"/>
    </foo>
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search