skip to Main Content

In PHP, what is the fastest and simplest way to strip all HTML tags from a string, except the ones in an allowed list but by removing all HTML attributes.

The built-in function strip_tags would have done the job but the attributes are kept for the tags in the allowed list.
I don’t know if using regular expressions is the best way and I also don’t know if parsing the string wouldn’t be greedy.

2

Answers


  1. A regular expression might fail if an attribute has a > as a value of an attribute.

    A safer way would be to use DomDocumment but note that the input should be valid HTML and also the output might possibly be standardized.

    <?php
    
    $htmlString = '<span>777</span><div class="hello">hello <b id="12">world</b></div>';
    $stripped = strip_tags($htmlString, '<div><b>');
    
    $dom = new DOMDocument;              // init new DOMDocument
    $dom->loadHTML($stripped);           // load the HTML
    $xpath = new DOMXPath($dom);
    $nodes = $xpath->query('//@*');
    foreach ($nodes as $node) {
        $node->parentNode->removeAttribute($node->nodeName);
    }
    
    $cleanHtmlString = '';
    foreach ($dom->documentElement->firstChild->childNodes as $node) {
        $cleanHtmlString .= $dom->saveHTML($node);
    }
    
    echo $cleanHtmlString;
    

    Output:

    <p>777</p>
    <div>hello <b>world</b>
    </div>
    
    Login or Signup to reply.
  2. First of all, strip_tags does not prevent XXS attacks, so from a security perspective I would not recommend it, see here.

    However, here is an example of the solution I suggested in the comments. The trick is to use a special character to escape your allowed tags. This makes for a straightforward solution, as you can just use strip_tags.

    $string = '<b class="hello">Hello, </b><a>world!</a>';
    
    $allowed = array(
    
        'b' => chr(1) . 'b_open',
        '/b' => chr(1) . 'b_close',
        'i' => chr(1) . 'i_open',
        '/i' => chr(1) . 'i_close',
    
    );
    
    // Remove your special character from the input to prevent it from being injected
    
    $result = str_replace(chr(1), '', $string);
    
    // Escape the valid tags
    
    foreach ($allowed as $tag => $replacement) {
    
        $result = preg_replace('/<' . str_replace('/', '\/', $tag) . '([^>]*?)>/i', $replacement, $result);
    
    }
    
    // Call strip_tags
    
    $result = strip_tags($result);
    
    // Replace back
    
    foreach ($allowed as $tag => $replacement) {
    
        $result = str_replace($replacement, '<' . $tag . '>', $result);
    
    }
    
    echo($result);
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search