skip to Main Content

How to find first image or first video thumbnail in a webpage?
My code (below) retrieves an image but it’s always not the first visible image and it doesn’t fetch video thumbnails properly as well.

Sample URLs:

  1. https://www.bbc.com/news/articles/c0ve1zn9wlyo
  2. https://www.bbc.com/news/articles/c0r8edwj9wlo

The above are sample webpages, but I am looking for a generic logic that is able to get the first top image which covers an article.

Testing code:

How to make my code work more accurately?

public String fetchTopImage() throws Exception {
    // Select the top image element
    Element tImageElem = document.select("img").first();

    String tImage = null;
    if (tImageElem != null) {
        tImage = tImageElem.absUrl("src");
        log.trace("top image url is {}", tImage);
    }   
    else {
        log.trace("top image not found in {}", url);
    }   

    return tImage;
}

3

Answers


  1. public String fetchTopImage() throws Exception {
    String tImage = null;
    
    
    Element videoElem = document.select("video").first();
    if (videoElem != null) {
        tImage = videoElem.absUrl("poster");
        if (tImage != null && !tImage.isEmpty()) {
            log.trace("Top video thumbnail is {}", tImage);
            return tImage;
        }
    }
    
    for (Element imgElem : document.select("img")) {
        if (imgElem.hasAttr("src") && imgElem.is(":visible")) {
            tImage = imgElem.absUrl("src");
            log.trace("Top visible image url is {}", tImage);
            break;
        }
    }
    
    if (tImage == null) {
        log.trace("No visible top image or video thumbnail found in {}", url);
    }
    
    return tImage;
    }
    

    Try this

    Login or Signup to reply.
  2. In your sample case,
    This looks like made by Next.js with styled components.
    Which means it might be inserting video or image dynamically by using Javascript code.

    Possible Solution:
    Crawling is ALWAYS a specific fit as in your case, here you should fit your parser code to finding exactly what you want. Eg: video is not easy to parse because this video’s thumbnail is inside an <iframe> which itself is inside another <iframe>.

    I tested in Chrome with disabled Javascript, <video> doesn’t show.
    I think video is later inserted with Javascript after loading of page.
    In this case, you can not parse video using JSoup.

    In case of second example, the code will parse gray.png, if you skip it, you should check if the srcset attribute exists or not, or else find other ways to parse the page properly.

    Login or Signup to reply.
  3. " I am looking for a generic logic
    that is able to get the first top image which covers an article."

    Be aware that generic logic cannot work for all websites. Some websites create or load the data (image and video) dynamically when the code is running. The <image> tag might not exist in html source code, until some running JS function adds later such image into the page.

    Possible Solution:

    For webpages with an <img> or <video> tag existing then generic logic can be done like this:

    public String fetchTopImage() throws Exception {
        String tImage = null;
    
        // Search for <img> elements that have a 'src' attribute and are not empty
        Elements potentialImages = document.select("img[src]:not([src='']), img[data-src]:not([data-src=''])");
    
        // Search for <video> and <source> elements to find video thumbnails
        Elements videoElements = document.select("video, source");
    
        // Combine image and video elements for processing
        Elements allElements = new Elements();
        allElements.addAll(potentialImages);
        allElements.addAll(videoElements);
    
        // Check if any relevant elements were found
        if (!allElements.isEmpty()) {
            // Iterate over the combined list to find the first valid image or video thumbnail
            for (Element elem : allElements) {
                String imageUrl = elem.absUrl("src");
                if (imageUrl != null && !imageUrl.isEmpty()) {
                    tImage = imageUrl; // Set the URL as the top image or thumbnail
                    log.trace("Top image or video thumbnail URL is {}", tImage);
                    return tImage; // Return the first valid URL found
                }
            }
        }
    
        log.trace("No top image or video thumbnail found in {}", url);
        return tImage; // Return null if no valid image or video thumbnail was found
    }
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search