Beautifulsoup Returning Wrong href Value - SEO

AhmedOsama
November 16, 2021
244 views
0 votes
2 Answers

I’m using the following code for SERP to do some SEO, but when I try reading the href attribute I get incorrect results showing other wired URLs from the page but not the one intended. What is wrong with my code?

import requests
from bs4 import BeautifulSoup

URL = "https://www.google.com/search?q=beautiful+soup&rlz=1C1GCEB_enIN922IN922&oq=beautiful+soup&aqs=chrome..69i57j69i60l3.2455j0j7&sourceid=chrome&ie=UTF-8"
r = requests.get(URL)
webPage = html.unescape(r.text) 

soup = BeautifulSoup(webPage, 'html.parser')
text =''
gresults = soup.findAll('h3') 

for result in gresults:
    print (result.text)
    links = result.parent.parent.find_all('a', href=True)
    for link in links:
        print(link.get('href'))

The output looks like this:

/url?q=https://www.crummy.com/software/BeautifulSoup/bs4/doc/&sa=U&ved=2ahUKEwjv6-q3tJ30AhX_r1YBHU9OAeMQFnoECAAQAg&usg=AOvVaw2Q

Answers

What happens?

Selecting <h3> only will give you a result set with also unwanted elements.
Moving up to parents parent is okay, but try to find_all() (do not use older syntax findAll() in new code) is not necessary, this will give you also <a> you may not want.

How to fix?

Select your target element more specific and then you can use:

result.parent.parent.find('a',href=True).get('href')

But I would recommend to go with the following example.

Example

from bs4 import BeautifulSoup
import requests

    
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
url = f'http://www.google.com/search?q=beautiful+soup'

r = requests.get(url, headers= headers)
soup = BeautifulSoup(r.text, 'lxml')

data = []

for r in soup.select('#search a h3'):
    data.append({
        'title':r.text,
        'url':r.parent['href'],
     })
data

Output

[{'title': 'Beautiful Soup 4.9.0 documentation - Crummy',
  'url': 'https://www.crummy.com/software/BeautifulSoup/bs4/doc/'},
 {'title': 'Beautiful Soup Tutorial: Web Scraping mit Python',
  'url': 'https://lerneprogrammieren.de/beautiful-soup-tutorial/'},
 {'title': 'Beautiful Soup 4 - Web Scraping mit Python | HelloCoding',
  'url': 'https://hellocoding.de/blog/coding-language/python/beautiful-soup-4'},
 {'title': 'Beautiful Soup - Wikipedia',
  'url': 'https://de.wikipedia.org/wiki/Beautiful_Soup'},
 {'title': 'Beautiful Soup (HTML parser) - Wikipedia',
  'url': 'https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)'},
 {'title': 'Beautiful Soup Documentation — Beautiful Soup 4.4.0 ...',
  'url': 'https://beautiful-soup-4.readthedocs.io/'},
 {'title': 'BeautifulSoup4 - PyPI',
  'url': 'https://pypi.org/project/beautifulsoup4/'},
 {'title': 'Web Scraping und Parsen von HTML in Python mit Beautiful ...',
  'url': 'https://www.twilio.com/blog/web-scraping-und-parsen-von-html-python-mit-beautiful-soup'}]

1. It will return all <h3> elements from HTML, including text like "Related Searches, Videos, People Also Ask" sections, which in this case is not what you were looking for.

gresults = soup.findAll('h3')

2. This method of searching is good in some cases but not preferred in the particular case since you doing it kind of blindly OR imaging if one of those .parent nodes (elements) will disappear, it will throw an error.

Instead of doing all of this, call the appropriate CSS selector (more on that below) without doing this method chaining that can be unreadable (if there’re a lot of parent nodes).

result.parent.parent.find_all()

3. get('href') would work, but you get such URLs because of not passing user-agent to request headers which is needed to "act" as a real user visit. When user-agent is passed to request headers you’ll get a proper URL as you expected (I don’t know a proper explanation for such behavior).

If no user-agent is being passed to request headers while using requests library, it defaults to python-requests, so Google or other search engines (websites) understands that it’s a bot/script, and might block a request or a received HTML will be different from the one you see in your browser. Check what’s your user-agent. List of user-agents.

Pass user-agent to request headers:

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

requests.get('URL', headers=headers)

To make it work you need to:

1. Find a container with all the needed data (have a look at SelectorGadget extension) by calling a specific CSS selector. CSS selectors reference.

Think of the container as a box with stuff inside from which you’ll grab items by specifying which item you want to get. In your case, it would be (without using 2 for loops):

# .yuRUbf -> container
for result in soup.select('.yuRUbf'):
    
    # .DKV0Md -> CSS selector for title which is located inside a container
    title = result.select_one('.DKV0Md').text

    # grab <a> and extract href attribute.
    # .get('href') equal to ['href']
    link = result.select_one('a')['href']

Full code and example in the online IDE:

import requests
from bs4 import BeautifulSoup


headers = {
    'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582'
}

response = requests.get('https://www.google.com/search?q=beautiful+soup', headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')


# enumerate() -> adds a counter to an iterable and returns it
# https://www.programiz.com/python-programming/methods/built-in/enumerate
for index, result in enumerate(soup.select('.yuRUbf')):
    position = index + 1
    title = result.select_one('.DKV0Md').text
    link = result.select_one('a')['href']

    print(position, title, link, sep='n')


# part of the output
'''
1
Beautiful Soup 4.9.0 documentation - Crummy
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
2
Beautiful Soup Documentation — Beautiful Soup 4.4.0 ...
https://beautiful-soup-4.readthedocs.io/
3
BeautifulSoup4 - PyPI
https://pypi.org/project/beautifulsoup4/
'''

Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It’s a paid API with a free plan.

The difference in your case is that it was created for such tasks. You don’t have to figure out which CSS selector to use, how to bypass blocks from Google or other search engines, maintain the code over time (if something in the HTML will be changed). Instead, focus on the data you want to get. Check out the playground (requires login).

Code to integrate:

import os
from serpapi import GoogleSearch

params = {
    "api_key": os.getenv("API_KEY"),  # YOUR API KEY
    "engine": "google",               # search engine
    "q": "Beautiful Soup",            # query
    "hl": "en"                        # language
    # other parameters
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
    position = result["position"]          # website rank position
    title = result["title"]
    link = result["link"]

    print(position, title, link, sep="n")


# part of the output
'''
1
Beautiful Soup 4.9.0 documentation - Crummy
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
2
Beautiful Soup Documentation — Beautiful Soup 4.4.0 ...
https://beautiful-soup-4.readthedocs.io/
3
BeautifulSoup4 - PyPI
https://pypi.org/project/beautifulsoup4/
'''

Disclaimer, I work for SerpApi.

P.S. There’s a dedicated web scraping blog of mine.

Please signup or login to give your own answer.

Click here to cancel reply.

Beautifulsoup Returning Wrong href Value – SEO

Answers

What happens?

How to fix?

Example

Output