I decided for the sake of interest to collect data from the site (name, price per night, rating) for myself and encountered a misunderstanding. I get nothing on the output. I rewrote on other libraries but they say this one is better.
const cheerio = require("cheerio");
let fs = require('fs');
const base = "https://ostrovok.ru/hotel/russia/adler/";
(async () => {
let url = "?page=1";
const data = [];
for (let i = 0; i < 176; i++) {
try {
console.log(base + url);
const res = await fetch(base + url);
if (!res.ok) {
break;
}
const $ = cheerio.load(await res.text());
const chunk = [...$("")].map(e =>
$(e).text().trim()
);
data.push(chunk);
url = $("#__next > div > div:nth-child(2) > div > div > div.Layout_content__9ap_g > div:nth-child(3) > div > div.HotelCard_headerArea__hlQPk > div > div.HotelCard_mainInfo__pNKYU > div.HotelCard_wrapTitle__t742O > h2 > a").attr("TEXT");
}
catch (err) {
console.error(err);
break;
}
}
console.log(JSON.stringify(data, null, 2));
fs.writeFile('numbers.txt', data.join('n'), function(err) {
if (err) {
console.log(err);
}
});
})();
I was expecting to see a list of data, but I got [].
2
Answers
You pass an empty selector:
…that will not select anything.
You should specify which elements you want to select. For instance, if you want the hotel names, then maybe:
Or a combination of hotel names and prices:
Note that this website has internationalisation, so you may need to pass a parameter to use the language of your choice. But that depends on that third party website…
base + url
always uses?page=1
. Try interpolating the index variable in:${base}?page=${i}
..attr("TEXT")
looks incorrect. I assume you want all 20 hotel names on each page, so use[...$("...")].map(e => $(e).text())
to collect each name as a separate array element.As for the selector, long, browser-generated ultra-rigid selectors are prone to error. If any assumption in that chain changes, the whole thing breaks. Safer to use
".HotelCard_title__cpfvk"
, which is all that’s needed to identify the element you want, and nothing more or less.!res.ok
isn’t enough to determine when the pagination ends. Break when the result list is empty.Putting it together:
This takes awhile to run, so you could parallelize requests (at the risk of angering the server), or simply add some logs to ensure each chunk is coming through OK.
To get the other fields you want, you can modify the script as follows:
Note that I’ve loosened some selectors to use substrings, avoiding a situation where the generated-looking substring
"cpfvk"
changes in".HotelCard_title__cpfvk"
.Disclosure: I’m the author of the linked blog post.