skip to Main Content

In PHP I need to highlight multiple given words in a string, for example wrapping found matches inside a <em> tag.
But if I have a word ending in + I cannot do it.

I understand the below problem is that plus is not a word and breaks that b flag word match. But how can I write this so that it matches and wrapps all given words even if a given word ends in + ?

$my_text = 'test c+ and javascript etc but NOT javascripter';

$words_to_highlight = array('javascript', 'c+');


foreach($words_to_highlight as $word){
    
    $search_pattern = str_replace('+', '\+', $word);
    
    // this doesn't match replacement
    echo "n".preg_replace("/b(".$search_pattern.")b/i", '<em>$1</em>', $my_text);
    
    // works if I remove the b flag, but I don't want to match "javascript" inside "javascripter"
    echo "n".preg_replace("/(".$search_pattern.")/i", '<em>$1</em>', $my_text);
    
}

Output is:

test c+ and <em>javascript</em> etc but NOT javascripter
test c+ and <em>javascript</em> etc but NOT <em>javascript</em>er

test c+ and javascript etc but NOT javascripter
test <em>c+</em> and javascript etc but NOT javascripter

What I want to result is:

test <em>c+</em> and <em>javascript</em> etc but NOT javascripter

2

Answers


  1. Instead of using word boundaries, you can make use of whitspace boundaries in the form of lookarounds asserting not a non whitspace character to the left (?<!S) and the right (?!S)

    For escaping characters which are part of the regex syntax, you can use preg_quote.

    To to the replacement with a single pattern that matches all the words, you can dynamically create the regex with a non capture group listing all the alternatives separated by a pipe char |

    The final pattern would look like this:

    (?<!S)(?:javascript|c+)(?!S)
    

    See the matches in a regex demo and a PHP demo.

    As you are not matching other text, you don’t need a capture group and you can use the full match in the replacement denoted by $0

    For example:

    $my_text = 'test c+ and javascript etc but NOT javascripter';
    $words_to_highlight = array('javascript', 'c+');
    
    $pattern = sprintf(
        "/(?<!S)(?:%s)(?!S)/i",
        implode('|', array_map("preg_quote", $words_to_highlight))
    );
    
    echo preg_replace($pattern, '<em>$0</em>', $my_text);
    

    Output

    test <em>c+</em> and <em>javascript</em> etc but NOT javascripter
    
    Login or Signup to reply.
  2. Another idea is to make a little helper function that generates a regex pattern from each word and only ads word boundaries if there are word-characters at ^ start | or $ end of the string.

    function w_to_regex ($w) {
      return '/'.preg_replace('/^b|$b/', 'b', preg_quote($w,'/')).'/i';
    }
    

    Further preg_quote() is used to escape regex-characters. The patterns will look like:

    • /bjavascriptb/i
    • /bc+/i

    Add the u-flag if it’s UTF-8. You can use array_map to generate patterns and replace.

    $my_text = preg_replace(array_map('w_to_regex', $words), '<em>$0</em>', $my_text);
    

    See a PHP demo at tio.run – the result will be like this:
    test <em>c+</em> and <em>javascript</em> etc but NOT javascripter

    Note that one of the differences in my and the 4th bird’s answer is, that this one would highlight e.g. c+ in c++ which can be desired or not. However you have some different options and ideas.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search