BeautifulSoup not finding xml tag, how do i fix this? - Shopify

ahmadafiquddin
December 31, 2018
261 views
4 votes
4 Answers

Tried using beautifulsoup to scrape a shopify site, using findAll('url') returns an empty list. How do I retrieve the desired content?

import requests
from bs4 import BeautifulSoup as soupify
import lxml

webSite = requests.get('https://launch.toytokyo.com/sitemap_pages_1.xml')
pageSource = webSite.text
webSite.close()

pageSource = soupify(pageSource, "xml")
print(pageSource.findAll('url'))

The page that I’m trying to scrape: https://launch.toytokyo.com/sitemap_pages_1.xml

What I’m getting: an empty list

What I should be getting: not an empty list

Thanks everyone for helping, figured out the problem in my code, I was using an older version of findAll instead of find_all

Answers

- PraysonWDaniel
- December 31, 2018 at 11:25 am
- 0 votes
0
You can do:
```
import requests
from bs4 import BeautifulSoup as bs

url = 'https://launch.toytokyo.com/sitemap_pages_1.xml'

soup = bs(requests.get(url).content,'html.parser')


urls = [i.text for i in soup.find_all('loc')]
```
So basically I get the contents and locate loc tag that contains the urls, then I grab the content 😉

Updated: Required url tag and generate a dictionary
```
urls = [i for i in soup.find_all('url')]

s = [[{k.name:k.text} for k in urls[i] if not isinstance(k,str)] for i,_ in enumerate(urls)]
```
Use from pprint import pprint as print to get a beautiful print of s:
```
print(s)
```
Notes: you can use lxml parser as it is faster than html.parser
Login or Signup to reply.

As an alternative to BeautifulSoup, you can always use xml.etree.ElementTree to parse your XML urls located at the loc tag:

from requests import get
from xml.etree.ElementTree import fromstring, ElementTree
from pprint import pprint

url = 'https://launch.toytokyo.com/sitemap_pages_1.xml'

req = get(url)
tree = ElementTree(fromstring(req.text))

urls = []
for outer in tree.getroot():
    for inner in outer:
        namespace, tag = inner.tag.split("}")
        if tag == 'loc':
            urls.append(inner.text)

pprint(urls)

Which will give the following URLs in a list:

['https://launch.toytokyo.com/pages/about',
 'https://launch.toytokyo.com/pages/help',
 'https://launch.toytokyo.com/pages/terms',
 'https://launch.toytokyo.com/pages/visit-us']

From this, you can group your info into a collections.defaultdict:

from requests import get
from xml.etree.ElementTree import fromstring, ElementTree
from collections import defaultdict
from pprint import pprint

url = 'https://launch.toytokyo.com/sitemap_pages_1.xml'

req = get(url)
tree = ElementTree(fromstring(req.text))

data = defaultdict(dict)
for i, outer in enumerate(tree.getroot()):
    for inner in outer:
        namespace, tag = inner.tag.split("}")
        data[i][tag] = inner.text

pprint(data)

Which gives the following defaultdict of dictionaries with indices as keys:

defaultdict(<class 'dict'>,
            {0: {'changefreq': 'weekly',
                 'lastmod': '2018-07-26T14:37:12-07:00',
                 'loc': 'https://launch.toytokyo.com/pages/about'},
             1: {'changefreq': 'weekly',
                 'lastmod': '2018-11-26T07:58:43-08:00',
                 'loc': 'https://launch.toytokyo.com/pages/help'},
             2: {'changefreq': 'weekly',
                 'lastmod': '2018-08-02T08:57:58-07:00',
                 'loc': 'https://launch.toytokyo.com/pages/terms'},
             3: {'changefreq': 'weekly',
                 'lastmod': '2018-05-21T15:02:36-07:00',
                 'loc': 'https://launch.toytokyo.com/pages/visit-us'}})

If you wish to instead group by categories, then you can use a defaultdict of lists instead:

data = defaultdict(list)
for outer in tree.getroot():
    for inner in outer:
        namespace, tag = inner.tag.split("}")
        data[tag].append(inner.text)

pprint(data)

Which gives this different structure:

defaultdict(<class 'list'>,
            {'changefreq': ['weekly', 'weekly', 'weekly', 'weekly'],
             'lastmod': ['2018-07-26T14:37:12-07:00',
                         '2018-11-26T07:58:43-08:00',
                         '2018-08-02T08:57:58-07:00',
                         '2018-05-21T15:02:36-07:00'],
             'loc': ['https://launch.toytokyo.com/pages/about',
                     'https://launch.toytokyo.com/pages/help',
                     'https://launch.toytokyo.com/pages/terms',
                     'https://launch.toytokyo.com/pages/visit-us']})

- QHarr
- December 31, 2018 at 12:19 pm
- 0 votes
0
Another way using xpath
```
import requests
from lxml import html
url = 'https://launch.toytokyo.com/sitemap_pages_1.xml'
tree = html.fromstring( requests.get(url).content)
links = [link.text for link in tree.xpath('//url/loc')]
print(links)
```
Login or Signup to reply.

- SIM
- December 31, 2018 at 2:17 pm
- 0 votes
0
I’ve tried to show exactly the way you have already tried. The only thing you need to rectify is webSite.text. You could get valid response if you used webSite.content instead.

This is the corrected version of your existing attempt:
```
import requests
from bs4 import BeautifulSoup

webSite = requests.get('https://launch.toytokyo.com/sitemap_pages_1.xml')
pageSource = BeautifulSoup(webSite.content, "xml")
for k in pageSource.find_all('url'):
    link = k.loc.text
    date = k.lastmod.text
    frequency = k.changefreq.text
    print(f'{link}n{date}n{frequency}n')
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

BeautifulSoup not finding xml tag, how do i fix this? – Shopify

Answers