skip to Main Content

I am trying to calculate the word count of an HTML page using Cheerio in the following Gulp task:

gulp.task("html", () => {
    try {
        return gulp.src(folder.src + "html/**/*.html")
            .pipe(preprocess())
            .pipe(htmlmin({
                collapseWhitespace: true,
                removeComments: true,
                minifyCSS: true,
                minifyJS: true
            }))
            .pipe(cheerio(function ($, file) {
                try {
                    const text = $("body").children(":not(header, footer, script)").text();

                    const words = text.split(/s+/).filter(word => word.length > 0)
                    const wordCount = words.length;

                    const filename = file.relative;

                    if (filename == "clients.html") {
                        console.log(words);

                    }

                    console.log(wordCount, filename);
                } catch (error) {
                    console.error(error);
                }
            }))
            .pipe(gulp.dest(folder.build));
    } catch (error) {
        console.warn("Error parsing HTML: ", error);
    }
});

While this works, I am encountering an issue where words are not being split properly. For example, I get results like:

[
  'Our',           'Valued',            'ClientsAt',
  'Zubizi',        'Web',               'Solutions,'
]

The issue seems to be with the word "clientsAt", which should be split into two separate words ("Clients" and "At"), but instead, they are combined.

Here’s the HTML snippet I’m working with:


            <h1 class="text-3xl font-bold mb-3">Our Valued Clients</h1>
            <p class="text-lg mb-5">
                At Zubizi Web Solutions,
                ...

As you can see, the text "ClientsAt" is not properly split. How can I fix this issue?

FYI: the client.html is: https://zubizi.com/clients.html

Note: I am attempting to perform keyword analysis during the compilation process.

2

Answers


  1. (Caveat: I have neither run your code nor tested the below fully)

    It seems likely that you are experiencing the same behavior reported in this question: How to get the text of multiple elements split by delimiter using jQuery?

    If that assessment is accurate, code like the below, which is adapted from that question’s accepted answer, may be of use

    const text = $("body")
      .children(":not(header, footer, script)")
      .map(function () {
        return $(this).text();
      })
      .get()
      .join(" ");
    

    Basically, the idea is to intersperse a space between text from different elements; under your current approach, that content is being joined without a delimiter by the call to .text().

    Login or Signup to reply.
  2. I’m not able to reproduce the problem. Proof by construction:

    const html = `
    <h1 class="text-3xl font-bold mb-3">Our Valued Clients</h1>
    <p class="text-lg mb-5">
        At Zubizi Web Solutions,
    ...`;
    const $ = cheerio.load(html);
    const text = $("body").children(":not(header, footer, script)").text();
    const words = text.split(/s+/).filter(word => word.length > 0);
    console.log(words);
    <!-- Warning: don't use this in production, just for demos; use jQuery in the browser instead of cheerio -->
    <script src="https://bundle.run/[email protected]"></script>

    Output:

    [
      "Our",
      "Valued",
      "Clients",
      "At",
      "Zubizi",
      "Web",
      "Solutions,",
      "..."
    ]
    

    Even if the Gulp htmlmin step runs before this, it shouldn’t change the result (you can minify the HTML string and you’ll still get the same words as output, assuming the minifier isn’t broken).

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search