I am using Scrapy’s SitemapSpider to pull all product links from their respective collections. My list of sites are all Shopify stores and and the code that links to the products look like this:
<div class="grid__item grid-product medium--one-half large--one-third">
<div class="grid-product__wrapper">
<div class="grid-product__image-wrapper">
<a class="grid-product__image-link" href="/collections/accessories/products/black-double-layer-braided-leather-bracelet">
<img src="//cdn.shopify.com/s/files/1/1150/5108/products/product-image_50ce19b1-c700-4a77-9638-e2ac66a3acef_grande.jpg?v=1457310318" alt="Black Double Layer Braided Leather Bracelet" class="grid-product__image">
</a>
</div>
<a href="/collections/accessories/products/black-double-layer-braided-leather-bracelet" class="grid-product__meta">
<span class="grid-product__title">Black Double Layer Braided Leather Bracelet</span>
<span class="grid-product__price-wrap">
<span class="long-dash">—</span>
<span class="grid-product__price">
$ 15
</span>
</span>
</a>
</div>
</div>
Obviously, both href’s are the exact same. The problem I’m having is scraping both links when using the following code:
product_links = response.xpath('//div//a[contains(@href, "collections") and contains(@href, "products")][1]/@href').extract()
I’m trying to select the div element that has both a tags as descendants. From that, I only want to pull the href from the first a tag to avoid duplicate links.
Although each site is a Shopify, their source code for the collections page isn’t the exact same. So the depth of the a tag under the div element is inconsistent and I’m not able to add a predicate like
//div[@class="grid__item grid-product medium--one-half large--one-third"]
2
Answers
Just use the
extract_first()
command to to extract only the first matched element. And benifit of using this is that it avoids anIndexError
and returnsNone
when it doesn’t find any element matching the selection.So, it should be :