Keep accented characters while highlighting text (wrapping in <span> tags) - PHP

user934820
November 12, 2022
179 views
2 votes
3 Answers

I am using the following code to search and highlight accented text. The problem I am facing is that it removes accented text while highlighting. Is there anyway to keep accents?

echo highlightTerm("Would you like a café, Mister Kàpêk?", "kape caf");

function highlightTerm($text, $keyword) {
    $text = iconv('utf-8', 'ISO-8859-1//IGNORE', Normalizer::normalize($text, Normalizer::FORM_D));
    $words = explode(" ", $keyword);
    $p = implode('|', array_map('preg_quote', $words));
    return preg_replace(
        "/($p)/ui", 
        '<span style="background:yellow;">$1</span>', 
        $text
    );
}

Answers

A simple replace will not work for this. You have to split the text into words and compare the normalized words. You should use DOM to iterate and replace the text nodes. This avoids replacing the terms inside other node types (attributes, comments, …) and takes care of escaping.

Splitting could be done with Regular Expression, however here is a specific tool for it in the ext/intl extension called IntlBreakIterator. The extension has a Collator for string compare, too.

Here is a example for whole words:

$html = <<<'HTML'
<div>
Would you like a café, Mister Kàpêk?
</div>
HTML;

// prepare the text breaker
$breaker = IntlBreakIterator::createWordInstance('en_US');
// prepare the compare
$collator = new Collator('en_US');
$collator->setStrength(Collator::PRIMARY);

// wrap terms for easy use
$terms = new Terms(
    function($word) use ($collator) {
        return $collator->getSortKey($word);
    },
    'cafe',
    'kapek'
);

// load HTML fragment into DOM
$document = new DOMDocument();
$document->loadHTML(
    "<?xml encoding='UTF-8'?>n$html"
);
$xpath = new DOMXpath($document); 

// iterate text nodes
foreach ($xpath->evaluate('//text()') as $textNode) {
    // feed text into word breaker
    $breaker->setText($textNode->textContent);
    // prepare a fragment for new nodes
    $fragment = $document->createDocumentFragment();
    $replace = false; 
    // iterate words
    foreach ($breaker->getPartsIterator() as $word) {
        // find word in terms
        $index = $terms->indexOf($word) + 1;
        if ($index > 0) {
            $replace = true;
            // wrap in a "span" element
            $span = $document->createElement('span');
            $span->textContent = $word;
            $span->setAttribute('class', 'term');
            $span->setAttribute('data-term-index', $index);
            $fragment->appendChild($span);
        } else {
            $fragment->appendChild($document->createTextNode($word));
        }
    }
    if ($replace) {
        // replace original text node with new fragment
        $textNode->parentNode->replaceChild($fragment, $textNode);
    }
}

// DOMDocument::loadHTML() will have wrapped the HTML to 
// create a whole document
$result = '';
foreach ($xpath->evaluate('//body/node()') as $node) {
    $result .= $document->saveHTML($node);
}
echo $result;

class Terms {

    private $_normalize;    
    private $_hashes;
    
    public function __construct(
        callable $normalize, 
        string ...$terms
    ) {
        $this->_normalize = $normalize;
        $this->_hashes = array_flip(
            array_map(
                function(string $term): string { 
                   $normalize = $this->_normalize;
                   return $normalize($term);
                },
                $terms
            )
        );
    }
    
    public function indexOf(string $word): int {
       $normalize = $this->_normalize;
       $hash = $normalize($word);
       return $this->_hashes[$hash] ?? -1;
    }
}

Output:

<div>
Would you like a <span class="term" data-term-index="1">café</span>, Mister <span class="term" data-term-index="2">Kàpêk</span>?
</div>

Extending this to partial matches is possible but it can get complex. You would have to simplify the current word (and keep track of the position) until it matches a term, then build a the output fragment.

Here is a not-so-pretty approach to isolate the search terms in the normalized input string, then perform multibyte-safe surgery on the original string based on the offsets of the matches and the lengths of substrings.

I replaced your pattern delimiters with a symbol that preg_quote() will escape by default.

The replacements must be done in reverse so that the offset and length calculations are not skewed.

Normally this sort of task calls for preg_replace_callback(), but because the search is on the normalized string and the replacement is on the original string, the replacement step must be separated from the matching step.

I used strtr() to bruteforce the normalization because I am not very aware of the most reliable way to normalized accented characters. Feel free to replace that subprocess.

Code: (Demo)

define(
    'ACCENT_MAP',
    [
        "ъ" => "-", "ь" => "-", "Ъ" => "-", "Ь" => "-",
        "А" => "A", "Ă" => "A", "Ǎ" => "A", "Ą" => "A", "À" => "A", "Ã" => "A", "Á" => "A", "Æ" => "A", "Â" => "A", "Å" => "A", "Ǻ" => "A", "Ā" => "A", "א" => "A",
        "Б" => "B", "ב" => "B", "Þ" => "B",
        "Ĉ" => "C", "Ć" => "C", "Ç" => "C", "Ц" => "C", "צ" => "C", "Ċ" => "C", "Č" => "C", "©" => "C", "ץ" => "C",
        "Д" => "D", "Ď" => "D", "Đ" => "D", "ד" => "D", "Ð" => "D",
        "È" => "E", "Ę" => "E", "É" => "E", "Ë" => "E", "Ê" => "E", "Е" => "E", "Ē" => "E", "Ė" => "E", "Ě" => "E", "Ĕ" => "E", "Є" => "E", "Ə" => "E", "ע" => "E",
        "Ф" => "F", "Ƒ" => "F",
        "Ğ" => "G", "Ġ" => "G", "Ģ" => "G", "Ĝ" => "G", "Г" => "G", "ג" => "G", "Ґ" => "G",
        "ח" => "H", "Ħ" => "H", "Х" => "H", "Ĥ" => "H", "ה" => "H",
        "I" => "I", "Ï" => "I", "Î" => "I", "Í" => "I", "Ì" => "I", "Į" => "I", "Ĭ" => "I", "I" => "I", "И" => "I", "Ĩ" => "I", "Ǐ" => "I", "י" => "I", "Ї" => "I", "Ī" => "I", "І" => "I",
        "Й" => "J", "Ĵ" => "J",
        "ĸ" => "K", "כ" => "K", "Ķ" => "K", "К" => "K", "ך" => "K",
        "Ł" => "L", "Ŀ" => "L", "Л" => "L", "Ļ" => "L", "Ĺ" => "L", "Ľ" => "L", "ל" => "L",
        "מ" => "M", "М" => "M", "ם" => "M",
        "Ñ" => "N", "Ń" => "N", "Н" => "N", "Ņ" => "N", "ן" => "N", "Ŋ" => "N", "נ" => "N", "ŉ" => "N", "Ň" => "N",
        "Ø" => "O", "Ó" => "O", "Ò" => "O", "Ô" => "O", "Õ" => "O", "О" => "O", "Ő" => "O", "Ŏ" => "O", "Ō" => "O", "Ǿ" => "O", "Ǒ" => "O", "Ơ" => "O",
        "פ" => "P", "ף" => "P", "П" => "P",
        "ק" => "Q",
        "Ŕ" => "R", "Ř" => "R", "Ŗ" => "R", "ר" => "R", "Р" => "R", "®" => "R",
        "Ş" => "S", "Ś" => "S", "Ș" => "S", "Š" => "S", "С" => "S", "Ŝ" => "S", "ס" => "S",
        "Т" => "T", "Ț" => "T", "ט" => "T", "Ŧ" => "T", "ת" => "T", "Ť" => "T", "Ţ" => "T",
        "Ù" => "U", "Û" => "U", "Ú" => "U", "Ū" => "U", "У" => "U", "Ũ" => "U", "Ư" => "U", "Ǔ" => "U", "Ų" => "U", "Ŭ" => "U", "Ů" => "U", "Ű" => "U", "Ǖ" => "U", "Ǜ" => "U", "Ǚ" => "U", "Ǘ" => "U",
        "В" => "V", "ו" => "V",
        "Ý" => "Y", "Ы" => "Y", "Ŷ" => "Y", "Ÿ" => "Y",
        "Ź" => "Z", "Ž" => "Z", "Ż" => "Z", "З" => "Z", "ז" => "Z",
        "а" => "a", "ă" => "a", "ǎ" => "a", "ą" => "a", "à" => "a", "ã" => "a", "á" => "a", "æ" => "a", "â" => "a", "å" => "a", "ǻ" => "a", "ā" => "a", "א" => "a",
        "б" => "b", "ב" => "b", "þ" => "b",
        "ĉ" => "c", "ć" => "c", "ç" => "c", "ц" => "c", "צ" => "c", "ċ" => "c", "č" => "c", "©" => "c", "ץ" => "c",
        "Ч" => "ch", "ч" => "ch",
        "д" => "d", "ď" => "d", "đ" => "d", "ד" => "d", "ð" => "d",
        "è" => "e", "ę" => "e", "é" => "e", "ë" => "e", "ê" => "e", "е" => "e", "ē" => "e", "ė" => "e", "ě" => "e", "ĕ" => "e", "є" => "e", "ə" => "e", "ע" => "e",
        "ф" => "f", "ƒ" => "f",
        "ğ" => "g", "ġ" => "g", "ģ" => "g", "ĝ" => "g", "г" => "g", "ג" => "g", "ґ" => "g",
        "ח" => "h", "ħ" => "h", "х" => "h", "ĥ" => "h", "ה" => "h",
        "i" => "i", "ï" => "i", "î" => "i", "í" => "i", "ì" => "i", "į" => "i", "ĭ" => "i", "ı" => "i", "и" => "i", "ĩ" => "i", "ǐ" => "i", "י" => "i", "ї" => "i", "ī" => "i", "і" => "i",
        "й" => "j", "Й" => "j", "Ĵ" => "j", "ĵ" => "j",
        "ĸ" => "k", "כ" => "k", "ķ" => "k", "к" => "k", "ך" => "k",
        "ł" => "l", "ŀ" => "l", "л" => "l", "ļ" => "l", "ĺ" => "l", "ľ" => "l", "ל" => "l",
        "מ" => "m", "м" => "m", "ם" => "m",
        "ñ" => "n", "ń" => "n", "н" => "n", "ņ" => "n", "ן" => "n", "ŋ" => "n", "נ" => "n", "ŉ" => "n", "ň" => "n",
        "ø" => "o", "ó" => "o", "ò" => "o", "ô" => "o", "õ" => "o", "о" => "o", "ő" => "o", "ŏ" => "o", "ō" => "o", "ǿ" => "o", "ǒ" => "o", "ơ" => "o",
        "פ" => "p", "ף" => "p", "п" => "p",
        "ק" => "q",
        "ŕ" => "r", "ř" => "r", "ŗ" => "r", "ר" => "r", "р" => "r", "®" => "r",
        "ş" => "s", "ś" => "s", "ș" => "s", "š" => "s", "с" => "s", "ŝ" => "s", "ס" => "s",
        "т" => "t", "ț" => "t", "ט" => "t", "ŧ" => "t", "ת" => "t", "ť" => "t", "ţ" => "t",
        "ù" => "u", "û" => "u", "ú" => "u", "ū" => "u", "у" => "u", "ũ" => "u", "ư" => "u", "ǔ" => "u", "ų" => "u", "ŭ" => "u", "ů" => "u", "ű" => "u", "ǖ" => "u", "ǜ" => "u", "ǚ" => "u", "ǘ" => "u",
        "в" => "v", "ו" => "v",
        "ý" => "y", "ы" => "y", "ŷ" => "y", "ÿ" => "y",
        "ź" => "z", "ž" => "z", "ż" => "z", "з" => "z", "ז" => "z", "ſ" => "z",
        "™" => "tm",
        "@" => "at",
        "Ä" => "ae", "Ǽ" => "ae", "ä" => "ae", "æ" => "ae", "ǽ" => "ae",
        "ĳ" => "ij", "Ĳ" => "ij",
        "я" => "ja", "Я" => "ja",
        "Э" => "je", "э" => "je",
        "ё" => "jo", "Ё" => "jo",
        "ю" => "ju", "Ю" => "ju",
        "œ" => "oe", "Œ" => "oe", "ö" => "oe", "Ö" => "oe",
        "щ" => "sch", "Щ" => "sch",
        "ш" => "sh", "Ш" => "sh",
        "ß" => "ss",
        "Ü" => "ue",
        "Ж" => "zh", "ж" => "zh",
    ]);

With:

function highlightTerm($text, $keyword) {
    $mbLength = mb_strlen($text);
    $unaccented = strtr($text, ACCENT_MAP);
    $words = explode(" ", $keyword);
    $regex = implode('|', array_map('preg_quote', $words));
    if (preg_match_all("#$regex#ui", $unaccented, $m, PREG_OFFSET_CAPTURE)) {
        foreach (array_reverse($m[0]) as [$match, $offset]) {

            // normalized length
            $length = strlen($match);

            // new multibyte-safe substring
            $tag = '<span style="background:yellow;">'
                . mb_substr($text, $offset, $length)
                . '</span>';

            // actual multibyte-safe replacement on original text
            $text = mb_substr($text, 0, $offset)
                . $tag
                . mb_substr($text, $offset + $length);
        }
    }
    return $text;
}

echo highlightTerm("Would you like a café, Mister Kàpêk?", "kape caf");

Output:

Would you like a <span style="background:yellow;">caf</span>é, Mister <span style="background:yellow;">Kàpê</span>k?

Instead of normalizing the text, you can use the tedious approach of creating a dynamic, accent-agnostic regex pattern and then directly perform replacements on the input string.

The regex map (based on the second code block of this answer):

define(
    'ACCENT_MAP',
    [
        "A" => "[AАĂǍĄÀÃÁÆÂÅǺĀא]",
        "B" => "[BБבÞ]",
        "C" => "[CĈĆÇЦצĊČץ]",
        "D" => "[DДĎĐדÐ]",
        "E" => "[EÈĘÉËÊЕĒĖĚĔЄƏע]",
        "F" => "[FФƑ]",
        "G" => "[GĞĠĢĜГגҐ]",
        "H" => "[HחĦХĤה]",
        "I" => "[IIÏÎÍÌĮĬIИĨǏיЇĪІ]",
        "J" => "[JЙĴ]",
        "K" => "[KĸכĶКך]",
        "L" => "[LŁĿЛĻĹĽל]",
        "M" => "[MמМם]",
        "N" => "[NÑŃНŅןŊנŉŇ]",
        "O" => "[OØÓÒÔÕОŐŎŌǾǑƠ]",
        "P" => "[PפףП]",
        "Q" => "[Qק]",
        "R" => "[RŔŘŖרР]",
        "S" => "[SŞŚȘŠСŜס]",
        "T" => "[TТȚטŦתŤŢ]",
        "U" => "[UÙÛÚŪУŨƯǓŲŬŮŰǕǛǙǗ]",
        "V" => "[VВו]",
        "Y" => "[YÝЫŶŸ]",
        "Z" => "(?:Z|ŹŽŻЗז",
        "a" => "[aаăǎąàãáæâåǻāא]",
        "b" => "[bбבþ]",
        "c" => "[cĉćçцצċčץ]",
        "ch" => "(?:ch|ч)",
        "d" => "[dдďđדð]",
        "e" => "[eèęéëêеēėěĕєəע]",
        "f" => "[fфƒ]",
        "g" => "[gğġģĝгגґ]",
        "h" => "[hחħхĥה]",
        "i" => "[iiïîíìįĭıиĩǐיїīі]",
        "j" => "[jйĵ]",
        "k" => "[kĸכķкך]",
        "l" => "[lłŀлļĺľל]",
        "m" => "[mמмם]",
        "n" => "[nñńнņןŋנŉň]",
        "o" => "[oøóòôõоőŏōǿǒơ]",
        "p" => "[pפףп]",
        "q" => "[qק]",
        "r" => "[rŕřŗרр]",
        "s" => "[sşśșšсŝס]",
        "t" => "[tтțטŧתťţ]",
        "u" => "[uùûúūуũưǔųŭůűǖǜǚǘ]",
        "v" => "[vвו]",
        "y" => "[yýыŷÿ]",
        "z" => "[zźžżзזſ]",
        "ae" => "(?:ae|[ÄǼäæǽ])",
        "ch" => "(?:ch|[Чч])",
        "ij" => "(?:ij|[ĳĲ])",
        "ja" => "(?:ja|[яЯ])",
        "je" => "(?:je|[Ээ])",
        "jo" => "(?:jo|[ёЁ])",
        "ju" => "(?:ju|[юЮ])",
        "oe" => "(?:oe|[œŒöÖ])",
        "sch" => "(?:sch|[щЩ])",
        "sh" => "(?:sh|[шШ])",
        "ss" => "(?:ss|[ß])",
        "ue" => "(?:ue|[Ü)",
        "zh" => "(?:zh|[Жж])"
    ]);

Code: (Demo)

function highlightTerm($text, $keyword) {
    $regex = implode(
        '|',
        array_map(
            fn($w) => strtr(preg_quote($w), ACCENT_MAP),
            explode(" ", $keyword)
        )
    );
    return preg_replace(
               "#$regex#ui",
               '<span style="background:yellow;">$0</span>',
               $text
           );
}

echo highlightTerm("Would you like a café, Mister Kàpêk?", "kape caf");

Output:

Would you like a <span style="background:yellow;">caf</span>é, Mister <span style="background:yellow;">Kàpê</span>k?

Please signup or login to give your own answer.

Click here to cancel reply.

Keep accented characters while highlighting text (wrapping in <span> tags) – PHP

Answers