skip to Main Content

I’m using the following code for SERP to do some SEO, but when I try reading the href attribute I get incorrect results showing other wired URLs from the page but not the one intended. What is wrong with my code?

import requests
from bs4 import BeautifulSoup

URL = "https://www.google.com/search?q=beautiful+soup&rlz=1C1GCEB_enIN922IN922&oq=beautiful+soup&aqs=chrome..69i57j69i60l3.2455j0j7&sourceid=chrome&ie=UTF-8"
r = requests.get(URL)
webPage = html.unescape(r.text) 

soup = BeautifulSoup(webPage, 'html.parser')
text =''
gresults = soup.findAll('h3') 

for result in gresults:
    print (result.text)
    links = result.parent.parent.find_all('a', href=True)
    for link in links:
        print(link.get('href'))

The output looks like this:

/url?q=https://www.crummy.com/software/BeautifulSoup/bs4/doc/&sa=U&ved=2ahUKEwjv6-q3tJ30AhX_r1YBHU9OAeMQFnoECAAQAg&usg=AOvVaw2Q

2

Answers


  1. What happens?

    • Selecting <h3> only will give you a result set with also unwanted elements.

    • Moving up to parents parent is okay, but try to find_all() (do not use older syntax findAll() in new code) is not necessary, this will give you also <a> you may not want.

    How to fix?

    Select your target element more specific and then you can use:

    result.parent.parent.find('a',href=True).get('href')
    

    But I would recommend to go with the following example.

    Example

    from bs4 import BeautifulSoup
    import requests
    
        
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
    url = f'http://www.google.com/search?q=beautiful+soup'
    
    r = requests.get(url, headers= headers)
    soup = BeautifulSoup(r.text, 'lxml')
    
    data = []
    
    for r in soup.select('#search a h3'):
        data.append({
            'title':r.text,
            'url':r.parent['href'],
         })
    data   
    

    Output

    [{'title': 'Beautiful Soup 4.9.0 documentation - Crummy',
      'url': 'https://www.crummy.com/software/BeautifulSoup/bs4/doc/'},
     {'title': 'Beautiful Soup Tutorial: Web Scraping mit Python',
      'url': 'https://lerneprogrammieren.de/beautiful-soup-tutorial/'},
     {'title': 'Beautiful Soup 4 - Web Scraping mit Python | HelloCoding',
      'url': 'https://hellocoding.de/blog/coding-language/python/beautiful-soup-4'},
     {'title': 'Beautiful Soup - Wikipedia',
      'url': 'https://de.wikipedia.org/wiki/Beautiful_Soup'},
     {'title': 'Beautiful Soup (HTML parser) - Wikipedia',
      'url': 'https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)'},
     {'title': 'Beautiful Soup Documentation — Beautiful Soup 4.4.0 ...',
      'url': 'https://beautiful-soup-4.readthedocs.io/'},
     {'title': 'BeautifulSoup4 - PyPI',
      'url': 'https://pypi.org/project/beautifulsoup4/'},
     {'title': 'Web Scraping und Parsen von HTML in Python mit Beautiful ...',
      'url': 'https://www.twilio.com/blog/web-scraping-und-parsen-von-html-python-mit-beautiful-soup'}]
    
    Login or Signup to reply.
  2. 1. It will return all <h3> elements from HTML, including text like "Related Searches, Videos, People Also Ask" sections, which in this case is not what you were looking for.

    gresults = soup.findAll('h3')
    

    2. This method of searching is good in some cases but not preferred in the particular case since you doing it kind of blindly OR imaging if one of those .parent nodes (elements) will disappear, it will throw an error.

    Instead of doing all of this, call the appropriate CSS selector (more on that below) without doing this method chaining that can be unreadable (if there’re a lot of parent nodes).

    result.parent.parent.find_all()
    

    3. get('href') would work, but you get such URLs because of not passing user-agent to request headers which is needed to "act" as a real user visit. When user-agent is passed to request headers you’ll get a proper URL as you expected (I don’t know a proper explanation for such behavior).

    If no user-agent is being passed to request headers while using requests library, it defaults to python-requests, so Google or other search engines (websites) understands that it’s a bot/script, and might block a request or a received HTML will be different from the one you see in your browser. Check what’s your user-agent. List of user-agents.

    Pass user-agent to request headers:

    headers = {
        "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
    }
    
    requests.get('URL', headers=headers)
    

    To make it work you need to:

    1. Find a container with all the needed data (have a look at SelectorGadget extension) by calling a specific CSS selector. CSS selectors reference.

    Think of the container as a box with stuff inside from which you’ll grab items by specifying which item you want to get. In your case, it would be (without using 2 for loops):

    # .yuRUbf -> container
    for result in soup.select('.yuRUbf'):
        
        # .DKV0Md -> CSS selector for title which is located inside a container
        title = result.select_one('.DKV0Md').text
    
        # grab <a> and extract href attribute.
        # .get('href') equal to ['href']
        link = result.select_one('a')['href']
    

    Full code and example in the online IDE:

    import requests
    from bs4 import BeautifulSoup
    
    
    headers = {
        'User-Agent':
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582'
    }
    
    response = requests.get('https://www.google.com/search?q=beautiful+soup', headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    
    # enumerate() -> adds a counter to an iterable and returns it
    # https://www.programiz.com/python-programming/methods/built-in/enumerate
    for index, result in enumerate(soup.select('.yuRUbf')):
        position = index + 1
        title = result.select_one('.DKV0Md').text
        link = result.select_one('a')['href']
    
        print(position, title, link, sep='n')
    
    
    # part of the output
    '''
    1
    Beautiful Soup 4.9.0 documentation - Crummy
    https://www.crummy.com/software/BeautifulSoup/bs4/doc/
    2
    Beautiful Soup Documentation — Beautiful Soup 4.4.0 ...
    https://beautiful-soup-4.readthedocs.io/
    3
    BeautifulSoup4 - PyPI
    https://pypi.org/project/beautifulsoup4/
    ''' 
    

    Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It’s a paid API with a free plan.

    The difference in your case is that it was created for such tasks. You don’t have to figure out which CSS selector to use, how to bypass blocks from Google or other search engines, maintain the code over time (if something in the HTML will be changed). Instead, focus on the data you want to get. Check out the playground (requires login).

    Code to integrate:

    import os
    from serpapi import GoogleSearch
    
    params = {
        "api_key": os.getenv("API_KEY"),  # YOUR API KEY
        "engine": "google",               # search engine
        "q": "Beautiful Soup",            # query
        "hl": "en"                        # language
        # other parameters
    }
    
    search = GoogleSearch(params)
    results = search.get_dict()
    
    for result in results["organic_results"]:
        position = result["position"]          # website rank position
        title = result["title"]
        link = result["link"]
    
        print(position, title, link, sep="n")
    
    
    # part of the output
    '''
    1
    Beautiful Soup 4.9.0 documentation - Crummy
    https://www.crummy.com/software/BeautifulSoup/bs4/doc/
    2
    Beautiful Soup Documentation — Beautiful Soup 4.4.0 ...
    https://beautiful-soup-4.readthedocs.io/
    3
    BeautifulSoup4 - PyPI
    https://pypi.org/project/beautifulsoup4/
    '''
    

    Disclaimer, I work for SerpApi.


    P.S. There’s a dedicated web scraping blog of mine.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search