javascript regex treats swedish characters as special charachters and matches incorrectly

MohamadHammash
May 24, 2023
109 views
3 votes
3 Answers

I am currently working on a JavaScript feature that involves highlighting search results. Specifically, I want to implement a functionality where searching for a word, such as ‘sea’, within a sentence such as ‘the sea causes me nausea in this season’ will result in the word ‘sea’ and any instances where it acts as a prefix like the word ‘season’ to be higlighted. However, I do not want to highlight occurrences of ‘sea’ when it appears as a postfix like in the word ‘nausea’ nor when it is in the middle of a word like ‘disease’.

To achieve this, I am using the regular expression /bsea/gmi, which works perfectly with English characters. However, it fails to produce the desired results when applied to Swedish characters, like ‘ä’, ‘å’, and ‘ö’. For example, if the search word is ‘gen’, the postfix ‘gen’ in the word ‘vägen’ is incorrectly highlighted. It seems that the regular expression treats these characters as special characters or something similar. I even tried adding unicode modifier u but that didt’t help either.

Since my expertise lies mainly in C#, I’m not familiar with how JavaScript behaves in this context. I would greatly appreciate any insights or guidance on how JavaScript handles these situations or how to work around this problem.

Answers

- HazikArshad
- May 24, 2023 at 7:35 am
- 0 votes
0
You can change your regular express to handle Swedish Characters like following:
```
const searchTerm = 'sea';
const sentence = 'the sea causes me nausea in this season vägen';

const pattern = new RegExp(`\b${searchTerm}|\b${searchTerm}[äåöÄÅÖ]\w*`, 'gmi');
const highlightedSentence = sentence.replace(pattern, (match) => `<mark>${match}</mark>`);

console.log(highlightedSentence);
```
- b${searchTerm}[äåöÄÅÖ]w* matches the word ‘sea’ followed by a Swedish character
- The gmi is used to perform global search
- The mark tag is used to highlight the text
Login or Signup to reply.

- markalex
- May 24, 2023 at 8:14 am
- 0 votes
0
Javascript’s regex engine doesn’t change behavior of b depending on presence of u flag. But luckily you can imitate it using Unicode property classes.

In this exact case your regex would look like this: /(?<![p{L}p{N}_])gen/gmiu.

Here we check (using negative lookbehind) that gen is not immediately preceded by any of:
- p{L}: letter (in any language),
- p{N}: digit (in any language)
- _.
Basically [p{L}p{N}_] is alternative to w with considering of u flag. Please notice that this is default behavior in some other regex engines, for example PCRE.

Demo here.

And in general case b can be replaced with /(?<![p{L}p{N}_])(?=[p{L}p{N}_])|(?<=[p{L}p{N}_])(?![p{L}p{N}_])/gmu.

Demo here.
Login or Signup to reply.

- ByteBrawler0
- May 24, 2023 at 8:52 am
- 0 votes
0
In JavaScript, regular expressions are Unicode-aware by default. However, when using word boundary b in regular expressions, it may not work as expected with non-ASCII characters such as Swedish characters ‘ä’, ‘å’, and ‘ö’.

To handle this situation and ensure proper word boundary matching with Swedish characters, you can use a library like XRegExp (https://xregexp.com/). XRegExp provides an augmented, extensible regular expression syntax with additional features and fixes for some of the inconsistencies in JavaScript’s native regular expressions.

Here’s how you can modify your code to use XRegExp:
1. First, include the XRegExp library in your HTML file by adding the following script tag in the head section:
  
  <script src="https://unpkg.com/xregexp/xregexp-all.js"></script>
2. Use the XRegExp library to create a modified regular expression pattern that includes support for Swedish characters. Replace your existing regular expression with the following:
  
  var pattern = XRegExp('\bsea', 'gmi');
  In this example, we’re using XRegExp to create a regular expression pattern that matches the word ‘sea’ preceded by a word boundary. The ‘gmi’ flags are used to perform a global search (find all matches) while ignoring case and treating the string as multiple lines.
By using XRegExp, you should be able to achieve the desired highlighting behavior, including proper handling of Swedish characters like ‘ä’, ‘å’, and ‘ö’.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.