Javascript - Highlight substring in Arabic text and ignore diacritics

MuhammadShahryar
June 9, 2024
225 views
0 votes
2 Answers

Let’s say there are Arabic strings in javascript:

const resultText = 'اَلْيَوْمُ جَمِيلٌ وَالشَّمْسُ';
const searchText = 'والشم';

The searchText variable is dynamic and might have a different value on runtime. So we need to write a regex that will replace the searchText in the resultText, but the problem is that along with having to write a regex with dynamic variable, also want to ignore certain characters when matching, and those characters to ignore are the diacritics in Arabic. So the end result would be after replacing the string as follows:

"اَلْيَوْمُ جَمِيلٌ <span class="highlighted">وَالشَّم</span>سُ"

so basically want to wrap the searchText word with HTML span tag, but alongside want to ignore diacritics in order have a match for replacing, because the searchText will be without diacritics and the resultText will be with diacritics, if we first remove all the diacritics in resultText, then we would easily have match, but want to keep the diacritics in resultText and still match successfully so in order to do that will need to ignore the diacritics when matching searchText inside it.

So far we have achieved to wrap the matched word in HTML but ignoring the diacritics is remaining:

const searchText = 'والشم';

const resultText = 'اَلْيَوْمُ جَمِيلٌ وَالشَّمْسُ';

const regex1 = new RegExp(this.searchText, 'gi');

const finalText = result.replace(regex1, '<span class="highlighted">$&</span>');

For a hint – the below regex is used to clear all the diacritics from a string:

'وَالشَّمْسُ'.normalize('NFD').replace(/([^u0621-u063Au0641-u064Au0660-u0669a-zA-Z 0-9])/g, '');

So all the diacritics characters are in the above regex pattern, so how can we use the above regex pattern or modify it to use it in the text replacing regex along with the dynamic variable, as described above.

Tags: javascript regex

Answers

- Jonty
- June 2, 2024 at 8:27 pm
- 0 votes
0
To achieve your desired result of wrapping the matched word with HTML tags while ignoring certain characters (like the period in this case), you can modify your regular expression pattern to use a negative lookahead assertion. Here’s how you can do it:
```
const searchText = 'thing';
const result = 'some th.in.g';
// Escape special characters in the search text
const escapedSearchText = searchText.replace(/[.*+?^${}()|[]\]/g, 
'\$&');

// Construct the regular expression pattern with negative lookahead
const regex = new RegExp(`\b${escapedSearchText.replace(/./g, '\.')} 
(?!\.)\b`, 'gi');

// Replace the matched word with HTML span tags
const finalText = result.replace(regex, '<span class="highlighted">$& 
</span>');

console.log(finalText);
```
Login or Signup to reply.

- bobblebubble
- June 9, 2024 at 12:18 pm
- 0 votes
0
Besides diacritics marks in this one of your samples (comment) there are even characters without such marks that have variations. ا (alef) is a different character to أ (alef with hamza above). To match either you will need to identify all such characters that can occur and replace each occurance in your searchText with a character class, for example replace أ with [اأإآ].

To get started, I would do something like this (experimental, no experience with arabic text).
// highlight arabic substring in text function highlightArabic (search, text, verbose=false) { // check input for at least one letter or digit if(!/[p{L}p{N}]/u.test(search) || !/[p{L}p{N}]/u.test(text)) { return text; } // 1. normalize and remove special characters search = search.normalize('NFD').replace(/[^p{L}p{N} ]+/gu, ""); if(verbose) { console.log('s => ' + search); } // 2. add optional unicode marks between characters search = search.replace(/.{0}/gu, '\p{M}*'); // 3. replace characters with diacritics to variants ['[اأإآ]','[ؤئ]'].forEach((v) => { search = search.replace(new RegExp(v, "gu"), v); }); if(verbose) { console.log('p => ' + search); } let p = new RegExp(search, 'gui'); // i-flag if latin chars return text.replace(p,'<span class="highlighted">$&</span>'); } // test s = 'والشم'; txt = 'اَلْيَوْمُ جَمِيلٌ وَالشَّمْسُ'; console.log(highlightArabic(s, txt, true)); s = 'قل اعوذ'; txt = 'قُلْ أَعُوذُ بِرَبِّ النَّاسِ' console.log(highlightArabic(s, txt, true));
Note that I used .{0} to add p{M}* between all characters. If search often contains latin numbers and letters, you could target only the arabic chracters by (?=[ء-ي٠-٩])|(?<=[ء-ي٠-٩]).

Reference
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Javascript – Highlight substring in Arabic text and ignore diacritics

Answers