skip to Main Content

Tried using beautifulsoup to scrape a shopify site, using findAll('url') returns an empty list. How do I retrieve the desired content?

import requests
from bs4 import BeautifulSoup as soupify
import lxml

webSite = requests.get('https://launch.toytokyo.com/sitemap_pages_1.xml')
pageSource = webSite.text
webSite.close()

pageSource = soupify(pageSource, "xml")
print(pageSource.findAll('url'))

The page that I’m trying to scrape: https://launch.toytokyo.com/sitemap_pages_1.xml

What I’m getting: an empty list

What I should be getting: not an empty list

Thanks everyone for helping, figured out the problem in my code, I was using an older version of findAll instead of find_all

4

Answers


  1. You can do:

    import requests
    from bs4 import BeautifulSoup as bs
    
    url = 'https://launch.toytokyo.com/sitemap_pages_1.xml'
    
    soup = bs(requests.get(url).content,'html.parser')
    
    
    urls = [i.text for i in soup.find_all('loc')]
    

    So basically I get the contents and locate loc tag that contains the urls, then I grab the content 😉

    Updated: Required url tag and generate a dictionary

    urls = [i for i in soup.find_all('url')]
    
    s = [[{k.name:k.text} for k in urls[i] if not isinstance(k,str)] for i,_ in enumerate(urls)]
    

    Use from pprint import pprint as print to get a beautiful print of s:

    print(s)
    

    Notes: you can use lxml parser as it is faster than html.parser

    Login or Signup to reply.
  2. As an alternative to BeautifulSoup, you can always use xml.etree.ElementTree to parse your XML urls located at the loc tag:

    from requests import get
    from xml.etree.ElementTree import fromstring, ElementTree
    from pprint import pprint
    
    url = 'https://launch.toytokyo.com/sitemap_pages_1.xml'
    
    req = get(url)
    tree = ElementTree(fromstring(req.text))
    
    urls = []
    for outer in tree.getroot():
        for inner in outer:
            namespace, tag = inner.tag.split("}")
            if tag == 'loc':
                urls.append(inner.text)
    
    pprint(urls)
    

    Which will give the following URLs in a list:

    ['https://launch.toytokyo.com/pages/about',
     'https://launch.toytokyo.com/pages/help',
     'https://launch.toytokyo.com/pages/terms',
     'https://launch.toytokyo.com/pages/visit-us']
    

    From this, you can group your info into a collections.defaultdict:

    from requests import get
    from xml.etree.ElementTree import fromstring, ElementTree
    from collections import defaultdict
    from pprint import pprint
    
    url = 'https://launch.toytokyo.com/sitemap_pages_1.xml'
    
    req = get(url)
    tree = ElementTree(fromstring(req.text))
    
    data = defaultdict(dict)
    for i, outer in enumerate(tree.getroot()):
        for inner in outer:
            namespace, tag = inner.tag.split("}")
            data[i][tag] = inner.text
    
    pprint(data)
    

    Which gives the following defaultdict of dictionaries with indices as keys:

    defaultdict(<class 'dict'>,
                {0: {'changefreq': 'weekly',
                     'lastmod': '2018-07-26T14:37:12-07:00',
                     'loc': 'https://launch.toytokyo.com/pages/about'},
                 1: {'changefreq': 'weekly',
                     'lastmod': '2018-11-26T07:58:43-08:00',
                     'loc': 'https://launch.toytokyo.com/pages/help'},
                 2: {'changefreq': 'weekly',
                     'lastmod': '2018-08-02T08:57:58-07:00',
                     'loc': 'https://launch.toytokyo.com/pages/terms'},
                 3: {'changefreq': 'weekly',
                     'lastmod': '2018-05-21T15:02:36-07:00',
                     'loc': 'https://launch.toytokyo.com/pages/visit-us'}})
    

    If you wish to instead group by categories, then you can use a defaultdict of lists instead:

    data = defaultdict(list)
    for outer in tree.getroot():
        for inner in outer:
            namespace, tag = inner.tag.split("}")
            data[tag].append(inner.text)
    
    pprint(data)
    

    Which gives this different structure:

    defaultdict(<class 'list'>,
                {'changefreq': ['weekly', 'weekly', 'weekly', 'weekly'],
                 'lastmod': ['2018-07-26T14:37:12-07:00',
                             '2018-11-26T07:58:43-08:00',
                             '2018-08-02T08:57:58-07:00',
                             '2018-05-21T15:02:36-07:00'],
                 'loc': ['https://launch.toytokyo.com/pages/about',
                         'https://launch.toytokyo.com/pages/help',
                         'https://launch.toytokyo.com/pages/terms',
                         'https://launch.toytokyo.com/pages/visit-us']})
    
    Login or Signup to reply.
  3. Another way using xpath

    import requests
    from lxml import html
    url = 'https://launch.toytokyo.com/sitemap_pages_1.xml'
    tree = html.fromstring( requests.get(url).content)
    links = [link.text for link in tree.xpath('//url/loc')]
    print(links)
    
    Login or Signup to reply.
  4. I’ve tried to show exactly the way you have already tried. The only thing you need to rectify is webSite.text. You could get valid response if you used webSite.content instead.

    This is the corrected version of your existing attempt:

    import requests
    from bs4 import BeautifulSoup
    
    webSite = requests.get('https://launch.toytokyo.com/sitemap_pages_1.xml')
    pageSource = BeautifulSoup(webSite.content, "xml")
    for k in pageSource.find_all('url'):
        link = k.loc.text
        date = k.lastmod.text
        frequency = k.changefreq.text
        print(f'{link}n{date}n{frequency}n')
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search