skip to Main Content

I am trying to extract the img src from the following xml tag inside of an item

I am calling cheerio.load on my response data like so

const $ = cheerio.load(response.data, { xmlMode: true });
    $("item").each((i, item) => {

and I am coming across this specific tag in item that I want to extract the img src from

<figure class="wp-block-image size-large">
<img decoding="async" loading="lazy" width="800" height="572" src="http://wmcmuaythai.org/wp-content/uploads/2023/04/WhatsApp-Image-2023-04-07-at-3.18.13-PM-2-800x572.jpeg" alt="" class="wp-image-43535" srcset="http://wmcmuaythai.org/wp-content/uploads/2023/04/WhatsApp-Image-2023-04-07-at-3.18.13-PM-2-800x572.jpeg 800w, http://wmcmuaythai.org/wp-content/uploads/2023/04/WhatsApp-Image-2023-04-07-at-3.18.13-PM-2-350x250.jpeg 350w, http://wmcmuaythai.org/wp-content/uploads/2023/04/WhatsApp-Image-2023-04-07-at-3.18.13-PM-2-768x549.jpeg 768w, http://wmcmuaythai.org/wp-content/uploads/2023/04/WhatsApp-Image-2023-04-07-at-3.18.13-PM-2.jpeg 1024w" sizes="(max-width: 800px) 100vw, 800px" />
</figure>

I have tried the following cheerio queries and either keep getting undefined or not what I want.

$(item).find("figure").find("img").attr("src")
$(item).find("img").attr("src")
$(item).find("figure").children().find("img").attr("src")
$(item).find("figure").first().find("img").attr("src")

This is the rss feed in which I am trying to extract the figure from

http://wmcmuaythai.org/feed/

2

Answers


  1. You can use the $("img", item) selector to find the img tag within the item element and then use the .attr("src")

    const $ = cheerio.load(response.data, { xmlMode: true });
    
    $("item").each((i, item) => {
      const imgSrc = $("img", item).attr("src");
      console.log(imgSrc);
    });
    
    Login or Signup to reply.
  2. I’m not too familiar with XML but the tags you want look like they’re inside CDATA. I’ve had success in the past by loading the CDATA text into Cheerio, then traversing that inner structure.

    I also don’t know how to select content:encoded (the elements containing the CDATA) since Cheerio thinks : is a pseudoselector rather than part of the tag name, so the following approach is a bit crude.

    const cheerio = require("cheerio"); // ^1.0.0-rc.12
    
    fetch("<Your URL>")
      .then(res => {
        if (!res.ok) {
          throw Error(res.statusText);
        }
    
        return res.text();
      })
      .then(html => {
        const $ = cheerio.load(html, {xmlMode: true});
        const result = [...$("*")]
          .map(e => $.load($(e).text()))
          .filter(e => e("img").length)
          .flatMap(e => [...e("img")].map(e => $(e).attr("src")));
        const unique = new Set(result);
        console.log(unique);
        console.log(result.length); // => 204
        console.log(unique.size); // => 51
      })
      .catch(err => console.error(err));
    

    As you can see, this picks up some duplicate images so you may wish to refine the selectors a bit further or unflatten the map to maintain the groupings, depending on whatever your expected result is.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search