I am trying to calculate the word count of an HTML page using Cheerio in the following Gulp task:
gulp.task("html", () => {
try {
return gulp.src(folder.src + "html/**/*.html")
.pipe(preprocess())
.pipe(htmlmin({
collapseWhitespace: true,
removeComments: true,
minifyCSS: true,
minifyJS: true
}))
.pipe(cheerio(function ($, file) {
try {
const text = $("body").children(":not(header, footer, script)").text();
const words = text.split(/s+/).filter(word => word.length > 0)
const wordCount = words.length;
const filename = file.relative;
if (filename == "clients.html") {
console.log(words);
}
console.log(wordCount, filename);
} catch (error) {
console.error(error);
}
}))
.pipe(gulp.dest(folder.build));
} catch (error) {
console.warn("Error parsing HTML: ", error);
}
});
While this works, I am encountering an issue where words are not being split properly. For example, I get results like:
[
'Our', 'Valued', 'ClientsAt',
'Zubizi', 'Web', 'Solutions,'
]
The issue seems to be with the word "clientsAt", which should be split into two separate words ("Clients" and "At"), but instead, they are combined.
Here’s the HTML snippet I’m working with:
<h1 class="text-3xl font-bold mb-3">Our Valued Clients</h1>
<p class="text-lg mb-5">
At Zubizi Web Solutions,
...
As you can see, the text "ClientsAt" is not properly split. How can I fix this issue?
FYI: the client.html is: https://zubizi.com/clients.html
Note: I am attempting to perform keyword analysis during the compilation process.
2
Answers
(Caveat: I have neither run your code nor tested the below fully)
It seems likely that you are experiencing the same behavior reported in this question: How to get the text of multiple elements split by delimiter using jQuery?
If that assessment is accurate, code like the below, which is adapted from that question’s accepted answer, may be of use
Basically, the idea is to intersperse a space between text from different elements; under your current approach, that content is being joined without a delimiter by the call to .text().
I’m not able to reproduce the problem. Proof by construction:
Output:
Even if the Gulp
htmlmin
step runs before this, it shouldn’t change the result (you can minify the HTML string and you’ll still get the same words as output, assuming the minifier isn’t broken).