I’m trying to create a simple webscraper in python that find, download and create a pdf of certain images found in the website. For now I only created the webscraping part of the code:
import requests
from bs4 import BeautifulSoup
import numpy as np
url = 'website url'
page = requests.get(url)
print('=== website ===n',url)
soup = BeautifulSoup(page.content, 'html.parser')
images = soup.find_all('img')
print('=== images found ===')
for img in images:
if img.has_attr('src'):
print(img['src'])
This is what I get:
=== website ===
https://ita.net/stop-1/
=== images found ===
https://ita.net/wp-content/uploads/2021/09/021-5.jpg
https://ita.net/wp-content/uploads/2021/09/021-5-430x350.jpg
https://ita.net/wp-content/uploads/2021/09/004-5-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/005-5-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/006-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/007-5-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/008-5-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/009-5-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/010-5-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/011-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/012-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/013-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/014-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/015-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/016-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/017-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/018-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/019-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/020-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/021-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/022-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/023-3-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/024-3-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/025-4-722x1024.jpg
https://ita.net/wp-content/uploads/2022/03/ita-sidebar-5.jpg
https://ita.net/wp-content/uploads/2022/03/telegram-1.jpg
https://ita.net/wp-content/uploads/2021/11/ita-logo-w-1-1024x311.png
https://ita.net/wp-content/uploads/2021/11/premium-1024x407.png
I specified "certain" because my code finds all the images in the site and displays them. However I want only the images that ends in 722x1024.jpg
to be displayed (and so picked).
Someone has any idea on how to do it?
2
Answers
Or:
First: you can use
{'src': True}
to get images which havesrc
.Because
src
is astring
so you can use anystring
-functions – ie..endswith()
BeautifulSoup
allows also to use function in find:or with
lambda
It may also use
regex
Minimal working example.
I search
0.jpg
on books.toscrape.com created (by authors of module scrapy) specially to learn scraping.(see also toscrape.com)
Results: