How to find first image or first video thumbnail in a webpage?
My code (below) retrieves an image but it’s always not the first visible image and it doesn’t fetch video thumbnails properly as well.
Sample URLs:
The above are sample webpages, but I am looking for a generic logic that is able to get the first top image which covers an article.
Testing code:
How to make my code work more accurately?
public String fetchTopImage() throws Exception {
// Select the top image element
Element tImageElem = document.select("img").first();
String tImage = null;
if (tImageElem != null) {
tImage = tImageElem.absUrl("src");
log.trace("top image url is {}", tImage);
}
else {
log.trace("top image not found in {}", url);
}
return tImage;
}
3
Answers
Try this
In your sample case,
This looks like made by Next.js with styled components.
Which means it might be inserting video or image dynamically by using Javascript code.
Possible Solution:
Crawling is ALWAYS a specific fit as in your case, here you should fit your parser code to finding exactly what you want. Eg: video is not easy to parse because this video’s thumbnail is inside an
<iframe>
which itself is inside another<iframe>
.I tested in Chrome with disabled Javascript,
<video>
doesn’t show.I think video is later inserted with Javascript after loading of page.
In this case, you can not parse video using JSoup.
Check if
srcset
exists and the first image found should becamila cabello’s image.
If you want first video, then use HtmlUnit or Selenium.
In case of second example, the code will parse
gray.png
, if you skip it, you should check if thesrcset
attribute exists or not, or else find other ways to parse the page properly.Be aware that generic logic cannot work for all websites. Some websites create or load the data (image and video) dynamically when the code is running. The
<image>
tag might not exist in html source code, until some running JS function adds later such image into the page.Possible Solution:
For webpages with an
<img>
or<video>
tag existing then generic logic can be done like this: