Scrapy: How do I select the first a tag inside a div element using XPath - Shopify

user8560566
September 16, 2017
210 views
0 votes
2 Answers

I am using Scrapy’s SitemapSpider to pull all product links from their respective collections. My list of sites are all Shopify stores and and the code that links to the products look like this:

<div class="grid__item grid-product medium--one-half large--one-third">
  <div class="grid-product__wrapper">
    <div class="grid-product__image-wrapper">
      <a class="grid-product__image-link" href="/collections/accessories/products/black-double-layer-braided-leather-bracelet">
        <img src="//cdn.shopify.com/s/files/1/1150/5108/products/product-image_50ce19b1-c700-4a77-9638-e2ac66a3acef_grande.jpg?v=1457310318" alt="Black Double Layer Braided Leather Bracelet" class="grid-product__image">
          </a>
      
    </div>

    <a href="/collections/accessories/products/black-double-layer-braided-leather-bracelet" class="grid-product__meta">
      <span class="grid-product__title">Black Double Layer Braided Leather Bracelet</span>
      <span class="grid-product__price-wrap">
        <span class="long-dash">—</span>
        <span class="grid-product__price">
          
            $ 15
          
        </span>
      </span>
      
    </a>
  </div>
</div>

Obviously, both href’s are the exact same. The problem I’m having is scraping both links when using the following code:

product_links = response.xpath('//div//a[contains(@href, "collections") and contains(@href, "products")][1]/@href').extract()

I’m trying to select the div element that has both a tags as descendants. From that, I only want to pull the href from the first a tag to avoid duplicate links.

Although each site is a Shopify, their source code for the collections page isn’t the exact same. So the depth of the a tag under the div element is inconsistent and I’m not able to add a predicate like

//div[@class="grid__item grid-product medium--one-half large--one-third"]

Answers

- Serjik
- September 16, 2017 at 2:20 am
- 0 votes
0
```
product_links = response.xpath('//div//a[contains(@href, "collections") and contains(@href, "products")][1]/@href').extract()
print(product_links[0])  # This is your first a Tag
```
Login or Signup to reply.

- KaushikNP
- September 17, 2017 at 2:02 pm
- 0 votes
0
Just use the extract_first() command to to extract only the first matched element. And benifit of using this is that it avoids an IndexError and returns None when it doesn’t find any element matching the selection.

So, it should be :
```
>>> response.xpath('//div//a[contains(@href, "collections") and contains(@href, "products")]/@href').extract_first()
u'/collections/accessories/products/black-double-layer-braided-leather-bracelet'
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Scrapy: How do I select the first a tag inside a div element using XPath – Shopify

Answers