Html - Web scraping with Scrapy and Python from one script and a javascript website

SpicyyRiice
February 11, 2024
179 views
0 votes
2 Answers

Hi I’m trying to web scrape (with Scrapy) this website https://www.vaniercollege.qc.ca/sports-recreation/weekly-schedule/ from this script below

script.py

import scrapy
from scrapy.crawler import CrawlerProcess
from threading import Thread


class CourtSpider(scrapy.Spider):
    name = 'full_page'
    allowed_domains = ['vaniercollege.qc.ca']
    start_urls = ['https://www.vaniercollege.qc.ca/sports-recreation/weekly-schedule/']

    def parse(self, response):
        # Extract the entire HTML of the page
        page_html = response.text

        # You can either process the HTML right here, or yield it to be processed later
        yield {'html': page_html}

        # Optionally, save the HTML to a file
        with open('page_content.html', 'w', encoding='utf-8') as f:
            f.write(page_html)

# def run_spider_in_thread():
process = CrawlerProcess(settings={
    "FEEDS": {
        "items.json": {"format": "json"},
    },
})
process.crawl(CourtSpider)
process.start()  # Blocks until crawling is finished

When I get this data from this script, I have this object

<html>
    <title>You are being redirected...</title>
    <noscript>Javascript is required. Please enable javascript before you are allowed to see this page.</noscript>
    <script>
        var s = {}, u, c, U, r, i, l = 0, a, e = eval, w = String.fromCharCode, sucuri_cloudproxy_js = '',
        S = 'cD1TdHJpbmcuZnJvbUNoYXJDb2RlKDEwMCkgKyAiOXNlYyIuc3Vic3RyKDAsMSkgKyAiNyIgKyAnVnJDYScuc3Vic3RyKDMsIDEpICsnODMnLnNsaWNlKDEsMikrICcnICsnJysnSmhQYScuc3Vic3RyKDMsIDEpICsgJycgKyc4JyArICAnPzknLnNsaWNlKDEsMikrIjZzdWN1ciIuY2hhckF0KDApKyc8djRhJy5zdWJzdHIoMywgMSkgKyIiICtTdHJpbmcuZnJvbUNoYXJDb2RlKDEwMikgKyAnMScgKyAgImMiICsgICcnICsnJysiNCIgKyAiIiArIjRyIi5jaGFyQXQoMCkgKyAndUA5Jy5jaGFyQXQoMikrICcnICsnJysiZiIgKyAgJycgKyJkc3VjdXIiLmNoYXJBdCgwKSsiNHNlYyIuc3Vic3RyKDAsMSkgKyBTdHJpbmcuZnJvbUNoYXJDb2RlKDB4NjEpICsgIjciICsgIiIgKyJiIiArICJkIiArICIiICsiZHN1Ii5zbGljZSgwLDEpICsgIiIgK1N0cmluZy5mcm9tQ2hhckNvZGUoNTYpICsgImQiICsgJzQnICsgICIxIiArICI0c2VjIi5zdWJzdHIoMCwxKSArICI5aSIuY2hhckF0KDApICsgU3RyaW5nLmZyb21DaGFyQ29kZSgweDY2KSArICc6MCcuc2xpY2UoMSwyKSsnJztkb2N1bWVudC5jb29raWU9J3MnKycnKyd1JysnY3N1Y3VyJy5jaGFyQXQoMCkrICd1JysncicrJ2lzdWN1cmknLmNoYXJBdCgwKSArICdfJysnYycrJ2wnKycnKydvcycuY2hhckF0KDApKyd1JysnJysnZCcrJ3AnKydyJy5jaGFyQXQoMCkrJ29zdScuY2hhckF0KDApICsneCcrJ3knKycnKydfc3VjdScuY2hhckF0KDApICArJ3UnKyd1c3VjdXJpJy5jaGFyQXQoMCkgKyAnaXN1Jy5jaGFyQXQoMCkgKydkJysnX3N1Y3VyaScuY2hhckF0KDApICsgJzlzdWN1cmknLmNoYXJBdCgwKSArICdzOScuY2hhckF0KDEpKydzYicuY2hhckF0KDEpKyc4JysnYicrJ2ZzdScuY2hhckF0KDApICsnZicrJ2VzdWN1cicuY2hhckF0KDApKyAnMScrIj0iICsgcCArICc7cGF0aD0vO21heC1hZ2U9ODY0MDAnOyBsb2NhdGlvbi5yZWxvYWQoKTs=';
        L = S.length;
        U = 0;
        r = '';
        var A = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/';
        for (

I was trying to fetch the content but it always give me this page, how to enable javascript with scrapy and with only one script file?

Thanks

Answers

Well, if the website requires JavaScript to render its content, Scrapy alone won’t be sufficient to scrape it. To handle websites that rely heavily on JavaScript, you’ll need to use a headless browser like Selenium or Splash.

For this to work, you’ll need to install chromedriver on your machine and selenium with pip.

Some example code:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service

# Set up Chrome WebDriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')  # Run Chrome in headless mode (without GUI)
chrome_options.add_argument('--disable-dev-shm-usage')  # Avoid /dev/shm usage error
chrome_options.add_argument("--no-sandbox")   
# Instantiate Chrome WebDriver
service = Service()
driver = webdriver.Chrome(service=service, options=chrome_options)

# Load the webpage
url = 'https://www.vaniercollege.qc.ca/sports-recreation/weekly-schedule/'
driver.get(url)

# Set wait period for JavaScript to render the page
import time
time.sleep(5)  # Adjust the sleep time according to the page loading time

# Extract the page content
page_content = driver.page_source

# Save the HTML content to a file
with open('page_content.html', 'w', encoding='utf-8') as f:
    f.write(page_content)

# Quit the browser
driver.quit()

You can adjust the sleep time according to the page loading time to ensure that the page is fully loaded before extracting the content.

In Scrapy, when extracting data from the response object, you typically use the get() or getall() methods on selectors to extract specific data and avoid including HTML elements.
Try this kind of code.

import scrapy
from scrapy.crawler import CrawlerProcess

class CourtSpider(scrapy.Spider):
    name = 'full_page'
    allowed_domains = ['vaniercollege.qc.ca']
    start_urls = ['https://www.vaniercollege.qc.ca/sports-recreation/weekly-schedule/']

    def parse(self, response):
        # Extract specific data from the page using Scrapy selectors
        schedule_items = response.css('.schedule-item')

        for item in schedule_items:
            time = item.css('.time::text').get()
            activity = item.css('.activity::text').get()
            location = item.css('.location::text').get()

            yield {
                'time': time.strip() if time else None,
                'activity': activity.strip() if activity else None,
                'location': location.strip() if location else None
            }

# Run the spider
process = CrawlerProcess(settings={
    "FEEDS": {
        "items.json": {"format": "json"},
    },
})
process.crawl(CourtSpider)
process.start()  # Blocks until crawling is finished

Please signup or login to give your own answer.

Click here to cancel reply.

Html – Web scraping with Scrapy and Python from one script and a javascript website

Answers