I’m using the following code for SERP to do some SEO, but when I try reading the href
attribute I get incorrect results showing other wired URLs from the page but not the one intended. What is wrong with my code?
import requests
from bs4 import BeautifulSoup
URL = "https://www.google.com/search?q=beautiful+soup&rlz=1C1GCEB_enIN922IN922&oq=beautiful+soup&aqs=chrome..69i57j69i60l3.2455j0j7&sourceid=chrome&ie=UTF-8"
r = requests.get(URL)
webPage = html.unescape(r.text)
soup = BeautifulSoup(webPage, 'html.parser')
text =''
gresults = soup.findAll('h3')
for result in gresults:
print (result.text)
links = result.parent.parent.find_all('a', href=True)
for link in links:
print(link.get('href'))
The output looks like this:
/url?q=https://www.crummy.com/software/BeautifulSoup/bs4/doc/&sa=U&ved=2ahUKEwjv6-q3tJ30AhX_r1YBHU9OAeMQFnoECAAQAg&usg=AOvVaw2Q
2
Answers
What happens?
Selecting
<h3>
only will give you a result set with also unwanted elements.Moving up to parents
parent
is okay, but try tofind_all()
(do not use older syntaxfindAll()
in new code) is not necessary, this will give you also<a>
you may not want.How to fix?
Select your target element more specific and then you can use:
But I would recommend to go with the following example.
Example
Output
1. It will return all
<h3>
elements from HTML, including text like "Related Searches, Videos, People Also Ask" sections, which in this case is not what you were looking for.2. This method of searching is good in some cases but not preferred in the particular case since you doing it kind of blindly OR imaging if one of those
.parent
nodes (elements) will disappear, it will throw an error.Instead of doing all of this, call the appropriate
CSS
selector (more on that below) without doing this method chaining that can be unreadable (if there’re a lot of parent nodes).3.
get('href')
would work, but you get such URLs because of not passinguser-agent
to requestheaders
which is needed to "act" as a real user visit. Whenuser-agent
is passed to requestheaders
you’ll get a proper URL as you expected (I don’t know a proper explanation for such behavior).If no
user-agent
is being passed to requestheaders
while usingrequests
library, it defaults to python-requests, so Google or other search engines (websites) understands that it’s a bot/script, and might block a request or a received HTML will be different from the one you see in your browser. Check what’s youruser-agent
. List ofuser-agents
.Pass
user-agent
to requestheaders
:To make it work you need to:
1. Find a container with all the needed data (have a look at SelectorGadget extension) by calling a specific
CSS
selector.CSS
selectors reference.Full code and example in the online IDE:
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It’s a paid API with a free plan.
The difference in your case is that it was created for such tasks. You don’t have to figure out which
CSS
selector to use, how to bypass blocks from Google or other search engines, maintain the code over time (if something in the HTML will be changed). Instead, focus on the data you want to get. Check out the playground (requires login).Code to integrate:
P.S. There’s a dedicated web scraping blog of mine.