I’m trying to write a method to count the number of words when the content is in chinese and japanese. This should exclude the special characters / punctuations / whiteSpaces.
I tried creating a regex for each locale and find the words based on it. Tried looking for existing regex on internet but none of them seems to be working. My approach –
function countWords(text, locale) {
let wordCount = 0;
// Set the word boundary based on the locale
let wordBoundary = '\b';
if (locale === 'ja') {
// Japanese word boundary
wordBoundary = '[\p{Script=Hiragana}\p{Script=Katakana}\p{Script=Han}ー]+';
} else if (locale === 'zh') {
// Chinese word boundary
wordBoundary = '[\p{Script=Han}]+';
}
const regex = new RegExp(wordBoundary, 'gu');
const matches = text.matchAll(regex);
for (const match of matches) {
wordCount++;
}
return wordCount;
}
I thought this should work, but I’m comparing the word count in MS word and using this logic, they are coming different
2
Answers
Well, I did similer type of thing in Python.
Instead of completely depending on regular expressions, you can use existing language processing libraries that provide better word segmentation algorithms specifically designed for Chinese and Japanese. Here are a couple of popular libraries you can consider:
For Chinese: Jieba (结巴分词) is a widely used Chinese text segmentation library for Python. It provides efficient word segmentation for Chinese text. You can integrate Jieba into your JavaScript code using tools like Emscripten or WebAssembly to leverage its word segmentation capabilities.
For Japanese: MeCab (めかぶ) is a popular Japanese morphological analyzer and part-of-speech tagger. It can efficiently segment Japanese text into words. Similarly to Jieba, you can try using tools like Emscripten or WebAssembly to use MeCab within your JavaScript code.
Here’s an example of how you can modify your code to use the Jieba library for Chinese word segmentation:
Please note that integrating Jieba or MeCab into JavaScript might require additional setup steps, such as compiling the libraries for the web or using pre-compiled versions specifically built for JavaScript environments.
A possible word count approach could be based on a text segmentation array which was the result of calling an
Intl.Segmenter
instance’ssegment
method.Each segmented item features properties like e.g. …
… thus, in order to get the total word count, one could
reduce
the array of text segment items by validating an item’sisWordLike
value …Note … as of now Firefox still does not support/implement
Intl.Segmenter