In PHP I need to highlight multiple given words in a string, for example wrapping found matches inside a <em>
tag.
But if I have a word ending in +
I cannot do it.
I understand the below problem is that plus is not a word and breaks that b
flag word match. But how can I write this so that it matches and wrapps all given words even if a given word ends in +
?
$my_text = 'test c+ and javascript etc but NOT javascripter';
$words_to_highlight = array('javascript', 'c+');
foreach($words_to_highlight as $word){
$search_pattern = str_replace('+', '\+', $word);
// this doesn't match replacement
echo "n".preg_replace("/b(".$search_pattern.")b/i", '<em>$1</em>', $my_text);
// works if I remove the b flag, but I don't want to match "javascript" inside "javascripter"
echo "n".preg_replace("/(".$search_pattern.")/i", '<em>$1</em>', $my_text);
}
Output is:
test c+ and <em>javascript</em> etc but NOT javascripter
test c+ and <em>javascript</em> etc but NOT <em>javascript</em>er
test c+ and javascript etc but NOT javascripter
test <em>c+</em> and javascript etc but NOT javascripter
What I want to result is:
test <em>c+</em> and <em>javascript</em> etc but NOT javascripter
2
Answers
Instead of using word boundaries, you can make use of whitspace boundaries in the form of lookarounds asserting not a non whitspace character to the left
(?<!S)
and the right(?!S)
For escaping characters which are part of the regex syntax, you can use preg_quote.
To to the replacement with a single pattern that matches all the words, you can dynamically create the regex with a non capture group listing all the alternatives separated by a pipe char
|
The final pattern would look like this:
See the matches in a regex demo and a PHP demo.
As you are not matching other text, you don’t need a capture group and you can use the full match in the replacement denoted by
$0
For example:
Output
Another idea is to make a little helper function that generates a regex pattern from each word and only ads word boundaries if there are word-characters at
^
start|
or$
end of the string.Further
preg_quote()
is used to escape regex-characters. The patterns will look like:/bjavascriptb/i
/bc+/i
Add the
u
-flag if it’s UTF-8. You can use array_map to generate patterns and replace.See a PHP demo at tio.run – the result will be like this:
test <em>c+</em> and <em>javascript</em> etc but NOT javascripter
Note that one of the differences in my and the 4th bird’s answer is, that this one would highlight e.g.
c+
inc++
which can be desired or not. However you have some different options and ideas.