skip to Main Content

I am using Scrapy’s SitemapSpider to pull all product links from their respective collections. My list of sites are all Shopify stores and and the code that links to the products look like this:

<div class="grid__item grid-product medium--one-half large--one-third">
  <div class="grid-product__wrapper">
    <div class="grid-product__image-wrapper">
      <a class="grid-product__image-link" href="/collections/accessories/products/black-double-layer-braided-leather-bracelet">
        <img src="//cdn.shopify.com/s/files/1/1150/5108/products/product-image_50ce19b1-c700-4a77-9638-e2ac66a3acef_grande.jpg?v=1457310318" alt="Black Double Layer Braided Leather Bracelet" class="grid-product__image">
          </a>
      
    </div>

    <a href="/collections/accessories/products/black-double-layer-braided-leather-bracelet" class="grid-product__meta">
      <span class="grid-product__title">Black Double Layer Braided Leather Bracelet</span>
      <span class="grid-product__price-wrap">
        <span class="long-dash">—</span>
        <span class="grid-product__price">
          
            $ 15
          
        </span>
      </span>
      
    </a>
  </div>
</div>

Obviously, both href’s are the exact same. The problem I’m having is scraping both links when using the following code:

product_links = response.xpath('//div//a[contains(@href, "collections") and contains(@href, "products")][1]/@href').extract()

I’m trying to select the div element that has both a tags as descendants. From that, I only want to pull the href from the first a tag to avoid duplicate links.

Although each site is a Shopify, their source code for the collections page isn’t the exact same. So the depth of the a tag under the div element is inconsistent and I’m not able to add a predicate like

//div[@class="grid__item grid-product medium--one-half large--one-third"]

2

Answers


  1. product_links = response.xpath('//div//a[contains(@href, "collections") and contains(@href, "products")][1]/@href').extract()
    print(product_links[0])  # This is your first a Tag
    
    Login or Signup to reply.
  2. Just use the extract_first() command to to extract only the first matched element. And benifit of using this is that it avoids an IndexError and returns None when it doesn’t find any element matching the selection.

    So, it should be :

    >>> response.xpath('//div//a[contains(@href, "collections") and contains(@href, "products")]/@href').extract_first()
    u'/collections/accessories/products/black-double-layer-braided-leather-bracelet'
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search