skip to Main Content

The Zyte tutorial "Create your first spider" crawls this page which has a pager with a "normal" next link. But what if the next link contains only a href="#" and executes JavaScript instead, like many websites nowadays do? In that case, you have no URL for your next_page_links and cannot execute response.follow_all, right?

The chapter "Handle JavaScript" of the Zyte Tutorial suggests to use browser automation, and the example given there demonstrates how this works with the scrollBottom action for http://quotes.toscrape.com/scroll.

Unfortunately, there is no example how to handle a click action on a next link to make the next results load with JavaScript. Basically, as a proof of concept, clicking the link would even work with a normal link like on http://books.toscrape.com.

I tried this like that:

import scrapy

from scrapy import Request


class BooksToScrapeSpider(scrapy.Spider):
    name = "books_toscrape"
    start_urls = [
        "http://books.toscrape.com/catalogue/category/books/mystery_3/index.html"
    ]

    def parse(self, response):
        # Extract book data
        for book in response.css("article.product_pod"):
            yield {
                "name": book.css("h3 a::attr(title)").get(),
                "price": book.css(".price_color::text").get(),
            }

        # Find the "next" link
        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            self.logger.info(f"Found next page: {next_page}")
            yield Request(
                # response.urljoin(next_page),
                "http://books.toscrape.com/catalogue/category/books/mystery_3/index.html#",
                meta={
                    "zyte_api_automap": {
                        "browserHtml": True,
                        "actions": [
                            {
                                "action": "click",
                                "selector": {"type": "css", "value": "li.next a"},
                            },
                            {
                                "action": "waitForSelector",
                                "selector": {
                                    "type": "css",
                                    "value": "li.previous a",
                                },
                            },
                        ],
                    }
                },
                callback=self.parse,
            )
        else:
            self.logger.info("No next page found")

To perform Zyte browser automation, I first need a request, right? So it doesn’t work without an URL. In my fictitious case, the URL is http://books.toscrape.com/catalogue/category/books/mystery_3/index.html#, actually. But I do not want to fire a request and then perform an action. What I want is to perform an action without request (like an ‘onclick’ event does), and this action does something like, for example, a request.

I’ve been racking my brains for days on how to do this – to no avail. Does anyone have any ideas for me?

2

Answers


  1. Chosen as BEST ANSWER

    I contacted Zyte's support, and they informed me that it's currently not possible to perform a click action without an additional request. I'm not the only one requesting this feature, so they might implement it in the future.


  2. A partial workaround could be to use networkCapture. It may capture the response of up to 10 clicks on the "Next" button (all 10 clicks executed within a single Zyte API request).

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search