skip to Main Content

Hi I’m trying to web scrape (with Scrapy) this website https://www.vaniercollege.qc.ca/sports-recreation/weekly-schedule/ from this script below

script.py

import scrapy
from scrapy.crawler import CrawlerProcess
from threading import Thread


class CourtSpider(scrapy.Spider):
    name = 'full_page'
    allowed_domains = ['vaniercollege.qc.ca']
    start_urls = ['https://www.vaniercollege.qc.ca/sports-recreation/weekly-schedule/']

    def parse(self, response):
        # Extract the entire HTML of the page
        page_html = response.text

        # You can either process the HTML right here, or yield it to be processed later
        yield {'html': page_html}

        # Optionally, save the HTML to a file
        with open('page_content.html', 'w', encoding='utf-8') as f:
            f.write(page_html)

# def run_spider_in_thread():
process = CrawlerProcess(settings={
    "FEEDS": {
        "items.json": {"format": "json"},
    },
})
process.crawl(CourtSpider)
process.start()  # Blocks until crawling is finished

When I get this data from this script, I have this object

<html>
    <title>You are being redirected...</title>
    <noscript>Javascript is required. Please enable javascript before you are allowed to see this page.</noscript>
    <script>
        var s = {}, u, c, U, r, i, l = 0, a, e = eval, w = String.fromCharCode, sucuri_cloudproxy_js = '',
        S = 'cD1TdHJpbmcuZnJvbUNoYXJDb2RlKDEwMCkgKyAiOXNlYyIuc3Vic3RyKDAsMSkgKyAiNyIgKyAnVnJDYScuc3Vic3RyKDMsIDEpICsnODMnLnNsaWNlKDEsMikrICcnICsnJysnSmhQYScuc3Vic3RyKDMsIDEpICsgJycgKyc4JyArICAnPzknLnNsaWNlKDEsMikrIjZzdWN1ciIuY2hhckF0KDApKyc8djRhJy5zdWJzdHIoMywgMSkgKyIiICtTdHJpbmcuZnJvbUNoYXJDb2RlKDEwMikgKyAnMScgKyAgImMiICsgICcnICsnJysiNCIgKyAiIiArIjRyIi5jaGFyQXQoMCkgKyAndUA5Jy5jaGFyQXQoMikrICcnICsnJysiZiIgKyAgJycgKyJkc3VjdXIiLmNoYXJBdCgwKSsiNHNlYyIuc3Vic3RyKDAsMSkgKyBTdHJpbmcuZnJvbUNoYXJDb2RlKDB4NjEpICsgIjciICsgIiIgKyJiIiArICJkIiArICIiICsiZHN1Ii5zbGljZSgwLDEpICsgIiIgK1N0cmluZy5mcm9tQ2hhckNvZGUoNTYpICsgImQiICsgJzQnICsgICIxIiArICI0c2VjIi5zdWJzdHIoMCwxKSArICI5aSIuY2hhckF0KDApICsgU3RyaW5nLmZyb21DaGFyQ29kZSgweDY2KSArICc6MCcuc2xpY2UoMSwyKSsnJztkb2N1bWVudC5jb29raWU9J3MnKycnKyd1JysnY3N1Y3VyJy5jaGFyQXQoMCkrICd1JysncicrJ2lzdWN1cmknLmNoYXJBdCgwKSArICdfJysnYycrJ2wnKycnKydvcycuY2hhckF0KDApKyd1JysnJysnZCcrJ3AnKydyJy5jaGFyQXQoMCkrJ29zdScuY2hhckF0KDApICsneCcrJ3knKycnKydfc3VjdScuY2hhckF0KDApICArJ3UnKyd1c3VjdXJpJy5jaGFyQXQoMCkgKyAnaXN1Jy5jaGFyQXQoMCkgKydkJysnX3N1Y3VyaScuY2hhckF0KDApICsgJzlzdWN1cmknLmNoYXJBdCgwKSArICdzOScuY2hhckF0KDEpKydzYicuY2hhckF0KDEpKyc4JysnYicrJ2ZzdScuY2hhckF0KDApICsnZicrJ2VzdWN1cicuY2hhckF0KDApKyAnMScrIj0iICsgcCArICc7cGF0aD0vO21heC1hZ2U9ODY0MDAnOyBsb2NhdGlvbi5yZWxvYWQoKTs=';
        L = S.length;
        U = 0;
        r = '';
        var A = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/';
        for (

I was trying to fetch the content but it always give me this page, how to enable javascript with scrapy and with only one script file?

Thanks

2

Answers


  1. Well, if the website requires JavaScript to render its content, Scrapy alone won’t be sufficient to scrape it. To handle websites that rely heavily on JavaScript, you’ll need to use a headless browser like Selenium or Splash.

    For this to work, you’ll need to install chromedriver on your machine and selenium with pip.

    Some example code:

    from selenium import webdriver
    from selenium.webdriver.chrome.service import Service
    
    # Set up Chrome WebDriver
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument('--headless')  # Run Chrome in headless mode (without GUI)
    chrome_options.add_argument('--disable-dev-shm-usage')  # Avoid /dev/shm usage error
    chrome_options.add_argument("--no-sandbox")   
    # Instantiate Chrome WebDriver
    service = Service()
    driver = webdriver.Chrome(service=service, options=chrome_options)
    
    # Load the webpage
    url = 'https://www.vaniercollege.qc.ca/sports-recreation/weekly-schedule/'
    driver.get(url)
    
    # Set wait period for JavaScript to render the page
    import time
    time.sleep(5)  # Adjust the sleep time according to the page loading time
    
    # Extract the page content
    page_content = driver.page_source
    
    # Save the HTML content to a file
    with open('page_content.html', 'w', encoding='utf-8') as f:
        f.write(page_content)
    
    # Quit the browser
    driver.quit()
    

    You can adjust the sleep time according to the page loading time to ensure that the page is fully loaded before extracting the content.

    Login or Signup to reply.
  2. In Scrapy, when extracting data from the response object, you typically use the get() or getall() methods on selectors to extract specific data and avoid including HTML elements.
    Try this kind of code.

    import scrapy
    from scrapy.crawler import CrawlerProcess
    
    class CourtSpider(scrapy.Spider):
        name = 'full_page'
        allowed_domains = ['vaniercollege.qc.ca']
        start_urls = ['https://www.vaniercollege.qc.ca/sports-recreation/weekly-schedule/']
    
        def parse(self, response):
            # Extract specific data from the page using Scrapy selectors
            schedule_items = response.css('.schedule-item')
    
            for item in schedule_items:
                time = item.css('.time::text').get()
                activity = item.css('.activity::text').get()
                location = item.css('.location::text').get()
    
                yield {
                    'time': time.strip() if time else None,
                    'activity': activity.strip() if activity else None,
                    'location': location.strip() if location else None
                }
    
    # Run the spider
    process = CrawlerProcess(settings={
        "FEEDS": {
            "items.json": {"format": "json"},
        },
    })
    process.crawl(CourtSpider)
    process.start()  # Blocks until crawling is finished
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search