Hi all,
I am scraping questions on Amazon using the following code:
url = "https://www.amazon.com/ask/questions/asin/B0000CFLYJ/1/ref=ask_ql_psf_ql_hza?isAnswered=true"
r = requests.get("http://localhost:8050/render.html", params = {'url': url, 'wait': 3})
soup = BeautifulSoup(r.text, 'html.parser')
questions = soup.find_all('div', {'class':'a-fixed-left-grid-col a-col-right'})
print(questions)
question_list = []
for item in questions:
question = item.find('a',{'class':'a-link-normal'}).text.strip()
question_list.append(question)
But I keep getting the following error:
AttributeError: 'NoneType' object has no attribute 'text'
Do I need some sort of exception handler? Or should I extract the text question using a different element all together? I’ve tried using the class below it which is a span element but to no avail:
<div class="a-fixed-left-grid-col a-col-right" style="padding-left:0%;float:left;">
<a class="a-link-normal" href="/ask/questions/Tx150GKDGF6FGAY/ref=ask_ql_ql_al_hza">
<span class="a-declarative" data-action="ask-no-op" data-ask-no-op='{"metricName":"top-question-text-click"}' data-csa-c-func-deps="aui-da-ask-no-op" data-csa-c-id="bsypsr-tzr1os-ttv9h7-td9hn6" data-csa-c-type="widget">
It comes in already made in a spray bottle, but yet says it's concentrated and gives dilution instructions?
So do I use it as is, or dilute?
</span>
I’m just trying to scrape the first page of questions before looping through the other pages. Any help would be greatly appreciated!
REVISION/ BASED OFF @Unmitigated Response Below (will scrape multiple pages of questions). Thanks!:
question_list = []
# Define some functions for the scrape
def get_soup(url):
# Send to render using Splash
r = requests.get("http://localhost:8050/render.html", params = {'url': url, 'wait': 3})
soup = BeautifulSoup(r.text, 'html.parser')
return soup
def get_questions(soup):
for item in soup.select('.askTeaserQuestions > div'):
question = item.find('a', {'class':'a-link-normal'}).getText(strip=True)
question_list.append(question)
# Loop through pages and call functions from above... 10 reviews per page
for x in range(1,6):
soup = get_soup(f'https://www.amazon.com/ask/questions/asin/B0000CFLYJ/{x}')
get_questions(soup)
print(len(question_list))
# When we find disabled last/last page element... then stop looping through pages
if not soup.find('li',{'class':'a-disabled a-last'}):
pass
else:
break
# Last step is to use pandas to export data to a excel
df = pd.DataFrame(question_list)
df.to_excel('SimpleGreen_Amazon_Questions_22oz_1pk_diff_seller.xlsx', index = False)
print('Web Scrape and Export of Questions completed successfully!')
2
Answers
You can try to set
User-Agent
HTTP header so the server correctly responds:Prints:
You can select the first link in all the direct children of the element with class
askTeaserQuestions
.