skip to Main Content

I decided for the sake of interest to collect data from the site (name, price per night, rating) for myself and encountered a misunderstanding. I get nothing on the output. I rewrote on other libraries but they say this one is better.

const cheerio = require("cheerio"); 
let fs = require('fs');
const base = "https://ostrovok.ru/hotel/russia/adler/";

(async () => {
  let url = "?page=1";
  const data = [];

  for (let i = 0; i < 176; i++) {
    try {
      console.log(base + url);
      const res = await fetch(base + url);

      if (!res.ok) {
        break;
      }

      const $ = cheerio.load(await res.text());
      const chunk = [...$("")].map(e =>
        $(e).text().trim()
      );
      data.push(chunk);
      url = $("#__next > div > div:nth-child(2) > div > div > div.Layout_content__9ap_g > div:nth-child(3) > div > div.HotelCard_headerArea__hlQPk > div > div.HotelCard_mainInfo__pNKYU > div.HotelCard_wrapTitle__t742O > h2 > a").attr("TEXT");
    }
    catch (err) {
      console.error(err);
      break;
    }
  }

  console.log(JSON.stringify(data, null, 2));

  fs.writeFile('numbers.txt', data.join('n'), function(err) {
    if (err) {
        console.log(err);
    }
});

})();

I was expecting to see a list of data, but I got [].

2

Answers


  1. You pass an empty selector:

    $("")
    

    …that will not select anything.

    You should specify which elements you want to select. For instance, if you want the hotel names, then maybe:

    $(".HotelCard_title__cpfvk")
    

    Or a combination of hotel names and prices:

    $(".HotelCard_title__cpfvk,.HotelCard_ratePriceValue__s3HvW")
    

    Note that this website has internationalisation, so you may need to pass a parameter to use the language of your choice. But that depends on that third party website…

    Login or Signup to reply.
  2. base + url always uses ?page=1. Try interpolating the index variable in: ${base}?page=${i}.

    .attr("TEXT") looks incorrect. I assume you want all 20 hotel names on each page, so use [...$("...")].map(e => $(e).text()) to collect each name as a separate array element.

    As for the selector, long, browser-generated ultra-rigid selectors are prone to error. If any assumption in that chain changes, the whole thing breaks. Safer to use ".HotelCard_title__cpfvk", which is all that’s needed to identify the element you want, and nothing more or less.

    !res.ok isn’t enough to determine when the pagination ends. Break when the result list is empty.

    Putting it together:

    const cheerio = require("cheerio"); // ^1.0.0-rc.12
    const {writeFile} = require("node:fs/promises");
    
    const url = "<Your URL>";
    
    (async () => {
      const data = [];
    
      for (let i = 1; i <= 1000; i++) {
        const res = await fetch(`${url}?page=${i}`);
    
        if (!res.ok) {
          break;
        }
        
        const $ = cheerio.load(await res.text());
        const chunk = [...$(".HotelCard_title__cpfvk")]
          .map(e => $(e).text());
    
        if (!chunk.length) {
          break;
        }
    
        data.push(...chunk);
      }
    
      console.log(data);
      await writeFile("numbers.txt", JSON.stringify(data));
    })();
    

    This takes awhile to run, so you could parallelize requests (at the risk of angering the server), or simply add some logs to ensure each chunk is coming through OK.

    To get the other fields you want, you can modify the script as follows:

    const chunk = [...$('[data-testid="serp-hotelcard"]')]
      .map(e => ({
        name: $(e).find('[class*="HotelCard_title"]').text(),
        price: $(e).find('[class*="HotelCard_ratePriceValue"]').text(),
        rating: $(e).find('[class*="TripAdvisor_tripAdvisor_value"]')
          .first()
          .attr("class")
          ?.split(/s+/)
          .find(e => e.includes("TripAdvisor_tripAdvisor_value"))
          .match(/_value_(d+)_/)[1]
          .split("")
          .join("."),
      }));
    

    Note that I’ve loosened some selectors to use substrings, avoiding a situation where the generated-looking substring "cpfvk" changes in ".HotelCard_title__cpfvk".

    Disclosure: I’m the author of the linked blog post.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search