I am using the following code to search and highlight accented text. The problem I am facing is that it removes accented text while highlighting. Is there anyway to keep accents?
echo highlightTerm("Would you like a café, Mister Kàpêk?", "kape caf");
function highlightTerm($text, $keyword) {
$text = iconv('utf-8', 'ISO-8859-1//IGNORE', Normalizer::normalize($text, Normalizer::FORM_D));
$words = explode(" ", $keyword);
$p = implode('|', array_map('preg_quote', $words));
return preg_replace(
"/($p)/ui",
'<span style="background:yellow;">$1</span>',
$text
);
}
3
Answers
A simple replace will not work for this. You have to split the text into words and compare the normalized words. You should use DOM to iterate and replace the text nodes. This avoids replacing the terms inside other node types (attributes, comments, …) and takes care of escaping.
Splitting could be done with Regular Expression, however here is a specific tool for it in the
ext/intl
extension calledIntlBreakIterator
. The extension has aCollator
for string compare, too.Here is a example for whole words:
Output:
Extending this to partial matches is possible but it can get complex. You would have to simplify the current word (and keep track of the position) until it matches a term, then build a the output fragment.
Here is a not-so-pretty approach to isolate the search terms in the normalized input string, then perform multibyte-safe surgery on the original string based on the offsets of the matches and the lengths of substrings.
I replaced your pattern delimiters with a symbol that
preg_quote()
will escape by default.The replacements must be done in reverse so that the offset and length calculations are not skewed.
Normally this sort of task calls for
preg_replace_callback()
, but because the search is on the normalized string and the replacement is on the original string, the replacement step must be separated from the matching step.I used
strtr()
to bruteforce the normalization because I am not very aware of the most reliable way to normalized accented characters. Feel free to replace that subprocess.Code: (Demo)
With:
Output:
Instead of normalizing the text, you can use the tedious approach of creating a dynamic, accent-agnostic regex pattern and then directly perform replacements on the input string.
The regex map (based on the second code block of this answer):
Code: (Demo)
Output: