skip to Main Content

I’m trying to create a simple webscraper in python that find, download and create a pdf of certain images found in the website. For now I only created the webscraping part of the code:

import requests
from bs4 import BeautifulSoup
import numpy as np

url = 'website url'
page = requests.get(url)
print('=== website ===n',url)
soup = BeautifulSoup(page.content, 'html.parser')

images = soup.find_all('img')

print('=== images found ===')

for img in images:
    if img.has_attr('src'):
        print(img['src'])

This is what I get:

=== website ===
 https://ita.net/stop-1/
=== images found ===
https://ita.net/wp-content/uploads/2021/09/021-5.jpg
https://ita.net/wp-content/uploads/2021/09/021-5-430x350.jpg
https://ita.net/wp-content/uploads/2021/09/004-5-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/005-5-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/006-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/007-5-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/008-5-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/009-5-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/010-5-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/011-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/012-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/013-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/014-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/015-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/016-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/017-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/018-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/019-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/020-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/021-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/022-4-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/023-3-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/024-3-722x1024.jpg
https://ita.net/wp-content/uploads/2021/09/025-4-722x1024.jpg
https://ita.net/wp-content/uploads/2022/03/ita-sidebar-5.jpg
https://ita.net/wp-content/uploads/2022/03/telegram-1.jpg
https://ita.net/wp-content/uploads/2021/11/ita-logo-w-1-1024x311.png
https://ita.net/wp-content/uploads/2021/11/premium-1024x407.png

I specified "certain" because my code finds all the images in the site and displays them. However I want only the images that ends in 722x1024.jpg to be displayed (and so picked).
Someone has any idea on how to do it?

2

Answers


  1. imgs = []
    for img in images:
        
        if img.has_attr('src'):
            if "722x1024.jpg" in img['src']:
               imgs.append(img['src'])
    

    Or:

    img_list = soup.find_all(
                lambda tag:tag.name == 'img' and
                'src' in tag.attrs and '722x1024.jpg' in tag.attrs['src'])
    
    Login or Signup to reply.
  2. First: you can use {'src': True} to get images which have src.

    Because src is a string so you can use any string-functions – ie. .endswith()

    images = soup.find_all('img', {'src': True})
    
    for img in images:
        if img['src'].endswith('722x1024.jpg'):
            print(img['src']))
    

    BeautifulSoup allows also to use function in find:

    def check(src):
        return (src is not None) and src.endswith('722x1024.jpg')
    
    images = soup.find_all('img', {'src': check})
    
    for img in images:
        print(img['src'])
    

    or with lambda

    images = soup.find_all('img', {'src': lambda x: (x is not None) and x.endswith('722x1024.jpg')})
    
    for img in images:
        print(img['src'])
    

    It may also use regex

    import re
    
    images = soup.find_all('img', {'src': re.compile('722x1024.jpg$')})
    
    for img in images:
        print(img['src'])    
    

    Minimal working example.

    I search 0.jpg on books.toscrape.com created (by authors of module scrapy) specially to learn scraping.

    (see also toscrape.com)

    import requests
    from bs4 import BeautifulSoup
    
    url = 'https://books.toscrape.com/'
    
    response = requests.get(url)
    
    soup = BeautifulSoup(response.content, 'html.parser')
    
    print('--- version 1 ---')
    
    images = soup.find_all('img', {'src': True})
    
    for img in images:
        if img['src'].endswith('0.jpg'):
            print(img['src'])
    
    print('--- version 2 a ---')
    
    def check(src):
        return (src is not None) and src.endswith('0.jpg')
    
    images = soup.find_all('img', {'src': check})
    
    for img in images:
        print(img['src'])
    
    print('--- version 2 b ---')
    
    images = soup.find_all('img', {'src': lambda x: (x is not None) and x.endswith('0.jpg')})
    
    for img in images:
        print(img['src'])
        
    print('--- version 3 ---')
    
    import re
    
    images = soup.find_all('img', {'src': re.compile('0.jpg$')})
    
    for img in images:
        print(img['src'])    
    

    Results:

    --- version 1 ---
    media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg
    media/cache/be/f4/bef44da28c98f905a3ebec0b87be8530.jpg
    --- version 2 a ---
    media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg
    media/cache/be/f4/bef44da28c98f905a3ebec0b87be8530.jpg
    --- version 2 b ---
    media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg
    media/cache/be/f4/bef44da28c98f905a3ebec0b87be8530.jpg
    --- version 3 ---
    media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg
    media/cache/be/f4/bef44da28c98f905a3ebec0b87be8530.jpg
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search