skip to Main Content

i want to be able to pull all urls from the following webpage using python https://yeezysupply.com/pages/all i tried using some other suggestions i found but they didn’t seem to work with this particular website. i would end up not finding any urls at all.

import urllib
import lxml.html
connection = urllib.urlopen('https://yeezysupply.com/pages/all')

dom =  lxml.html.fromstring(connection.read())

for link in dom.xpath('//a/@href'): 
    print link

2

Answers


  1. There are no links in the page source; they are inserted using Javascript after the page is loaded int the browser.

    Login or Signup to reply.
  2. perhaps it would be useful for you to make use of modules specifically designed for this. heres a quick and dirty script that gets the relative links on the page

    #!/usr/bin/python3
    
    import requests, bs4
    
    res = requests.get('https://yeezysupply.com/pages/all')
    
    soup = bs4.BeautifulSoup(res.text,'html.parser')
    links = soup.find_all('a')
    
    for link in links:
        print(link.attrs['href'])
    

    it generates output like this:

    /pages/jewelry
    /pages/clothing
    /pages/footwear
    /pages/all
    /cart
    /products/womens-boucle-dress-bleach/?back=%2Fpages%2Fall
    /products/double-sleeve-sweatshirt-bleach/?back=%2Fpages%2Fall
    /products/boxy-fit-zip-up-hoodie-light-sand/?back=%2Fpages%2Fall
    /products/womens-boucle-skirt-cream/?back=%2Fpages%2Fall
    etc...
    

    is this what you are looking for? requests and beautiful soup are amazing tools for scraping.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search