How to get list of url and use in scrapy python for web data extraction - Woocommerce

HannahJames
December 20, 2020
95 views
0 votes
2 Answers

I am creating web scraper using scrapy python. Here is my code

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = [
    'https://perfumehut.com.pk/shop/',
]

    def parse(self, response):

            yield {
                    'product_link': response.css('a.product-image-link::attr("href")').get(),
                    'product_title': response.css('h3.product-title>a::text').get(),
                    'product_price': response.css('span.price > span > bdi::text').get(),

                }
            next_page = response.css('ul.page-numbers>li>a.next.page-numbers::attr("href")').get()

            if next_page is not None:
                print()
                print(next_page)
                print()
                yield scrapy.Request(next_page)

    def parse(self, response):
        yield {
        'title': response.css('h1::text').get(),
        'batt': response.css('td.woocommerce-product-attributes-item__value p::text')[3].get(),
        'brand': response.css('div.woodmart-product-brand img::attr(alt)').get(),
        'brandimg': response.css('div.woodmart-product-brand img::attr(src)').get(),        
        'price': response.css('p.price').xpath('./span/bdi/text()').get(),
        'r-price': response.css('p.price').xpath('./del/span/bdi/text()').get(),
        's-sale': response.css('p.price').xpath('./ins/span/bdi/text()').get(),
        'breadcrumbs': response.css('nav.woocommerce-breadcrumb a::text').getall(),
        'tags': response.css('span.tagged_as a::text').getall(),
        'attributes': response.css('td.woocommerce-product-attributes-item__value p::text').getall(),
        'img': response.css('figure.woocommerce-product-gallery__image a::attr("href")').getall(),
        'description': response.css('div.woocommerce-product-details__short-description p::text').get(),
        'description1': response.css('#tab-description > div > div > p::text').getall(),
        'description2': response.css('#tab-description > div > div > div > div > div > div > div > div > p::text').getall()
        }

It’s a woocommerce website.
There are total of 57 pages and 12 products per page.
Total of 684 products estimated.

But my code returns nothing.

What I did wrong while scraping the URLs ?

Answers

To extract the all page information you need to extract the next page url and then parse the url.

Here is a simple example, I think that help you to sort out the issue.

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = [
    'https://perfumehut.com.pk/shop/',
]

    def parse(self, response):

            yield {
                    'product_link': response.css('a.product-image-link::attr("href")').get(),
                    'product_title': response.css('h3.product-title>a::text').get(),
                    'product_price': response.css('span.price > span > bdi::text').get(),

                }
            next_page = response.css('ul.page-numbers>li>a.next.page-numbers::attr("href")').get()

            if next_page is not None:
                print()
                print(next_page)
                print()
                yield scrapy.Request(next_page)

Okay, this should do it:

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = [
        'https://perfumehut.com.pk/shop/',
    ]

    def parse(self, response):
        for item in response.css(".product-grid-item"):
            yield {
                'product_link': item.css('a.product-image-link::attr("href")').get(),
                'product_title': item.css('h3.product-title > a::text').get(),
                'product_price': item.css('span.price > span > bdi::text').get(),
            }
        next_page = response.css('a.next:contains(→)::attr("href")').get()

        if next_page:
            yield scrapy.Request(next_page)

Please signup or login to give your own answer.

Click here to cancel reply.

How to get list of url and use in scrapy python for web data extraction – Woocommerce

Answers