Tried using beautifulsoup to scrape a shopify site, using findAll('url')
returns an empty list. How do I retrieve the desired content?
import requests
from bs4 import BeautifulSoup as soupify
import lxml
webSite = requests.get('https://launch.toytokyo.com/sitemap_pages_1.xml')
pageSource = webSite.text
webSite.close()
pageSource = soupify(pageSource, "xml")
print(pageSource.findAll('url'))
The page that I’m trying to scrape: https://launch.toytokyo.com/sitemap_pages_1.xml
What I’m getting: an empty list
What I should be getting: not an empty list
Thanks everyone for helping, figured out the problem in my code, I was using an older version of findAll instead of find_all
4
Answers
You can do:
So basically I get the contents and locate loc tag that contains the urls, then I grab the content 😉
Updated: Required url tag and generate a dictionary
Use from pprint import pprint as print to get a beautiful print of s:
Notes: you can use lxml parser as it is faster than html.parser
As an alternative to
BeautifulSoup
, you can always usexml.etree.ElementTree
to parse your XML urls located at theloc
tag:Which will give the following URLs in a list:
From this, you can group your info into a
collections.defaultdict
:Which gives the following defaultdict of dictionaries with indices as keys:
If you wish to instead group by categories, then you can use a defaultdict of lists instead:
Which gives this different structure:
Another way using xpath
I’ve tried to show exactly the way you have already tried. The only thing you need to rectify is
webSite.text
. You could get valid response if you usedwebSite.content
instead.This is the corrected version of your existing attempt: