Hi I’m trying to web scrape (with Scrapy) this website https://www.vaniercollege.qc.ca/sports-recreation/weekly-schedule/ from this script below
script.py
import scrapy
from scrapy.crawler import CrawlerProcess
from threading import Thread
class CourtSpider(scrapy.Spider):
name = 'full_page'
allowed_domains = ['vaniercollege.qc.ca']
start_urls = ['https://www.vaniercollege.qc.ca/sports-recreation/weekly-schedule/']
def parse(self, response):
# Extract the entire HTML of the page
page_html = response.text
# You can either process the HTML right here, or yield it to be processed later
yield {'html': page_html}
# Optionally, save the HTML to a file
with open('page_content.html', 'w', encoding='utf-8') as f:
f.write(page_html)
# def run_spider_in_thread():
process = CrawlerProcess(settings={
"FEEDS": {
"items.json": {"format": "json"},
},
})
process.crawl(CourtSpider)
process.start() # Blocks until crawling is finished
When I get this data from this script, I have this object
<html>
<title>You are being redirected...</title>
<noscript>Javascript is required. Please enable javascript before you are allowed to see this page.</noscript>
<script>
var s = {}, u, c, U, r, i, l = 0, a, e = eval, w = String.fromCharCode, sucuri_cloudproxy_js = '',
S = 'cD1TdHJpbmcuZnJvbUNoYXJDb2RlKDEwMCkgKyAiOXNlYyIuc3Vic3RyKDAsMSkgKyAiNyIgKyAnVnJDYScuc3Vic3RyKDMsIDEpICsnODMnLnNsaWNlKDEsMikrICcnICsnJysnSmhQYScuc3Vic3RyKDMsIDEpICsgJycgKyc4JyArICAnPzknLnNsaWNlKDEsMikrIjZzdWN1ciIuY2hhckF0KDApKyc8djRhJy5zdWJzdHIoMywgMSkgKyIiICtTdHJpbmcuZnJvbUNoYXJDb2RlKDEwMikgKyAnMScgKyAgImMiICsgICcnICsnJysiNCIgKyAiIiArIjRyIi5jaGFyQXQoMCkgKyAndUA5Jy5jaGFyQXQoMikrICcnICsnJysiZiIgKyAgJycgKyJkc3VjdXIiLmNoYXJBdCgwKSsiNHNlYyIuc3Vic3RyKDAsMSkgKyBTdHJpbmcuZnJvbUNoYXJDb2RlKDB4NjEpICsgIjciICsgIiIgKyJiIiArICJkIiArICIiICsiZHN1Ii5zbGljZSgwLDEpICsgIiIgK1N0cmluZy5mcm9tQ2hhckNvZGUoNTYpICsgImQiICsgJzQnICsgICIxIiArICI0c2VjIi5zdWJzdHIoMCwxKSArICI5aSIuY2hhckF0KDApICsgU3RyaW5nLmZyb21DaGFyQ29kZSgweDY2KSArICc6MCcuc2xpY2UoMSwyKSsnJztkb2N1bWVudC5jb29raWU9J3MnKycnKyd1JysnY3N1Y3VyJy5jaGFyQXQoMCkrICd1JysncicrJ2lzdWN1cmknLmNoYXJBdCgwKSArICdfJysnYycrJ2wnKycnKydvcycuY2hhckF0KDApKyd1JysnJysnZCcrJ3AnKydyJy5jaGFyQXQoMCkrJ29zdScuY2hhckF0KDApICsneCcrJ3knKycnKydfc3VjdScuY2hhckF0KDApICArJ3UnKyd1c3VjdXJpJy5jaGFyQXQoMCkgKyAnaXN1Jy5jaGFyQXQoMCkgKydkJysnX3N1Y3VyaScuY2hhckF0KDApICsgJzlzdWN1cmknLmNoYXJBdCgwKSArICdzOScuY2hhckF0KDEpKydzYicuY2hhckF0KDEpKyc4JysnYicrJ2ZzdScuY2hhckF0KDApICsnZicrJ2VzdWN1cicuY2hhckF0KDApKyAnMScrIj0iICsgcCArICc7cGF0aD0vO21heC1hZ2U9ODY0MDAnOyBsb2NhdGlvbi5yZWxvYWQoKTs=';
L = S.length;
U = 0;
r = '';
var A = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/';
for (
I was trying to fetch the content but it always give me this page, how to enable javascript with scrapy and with only one script file?
Thanks
2
Answers
Well, if the website requires JavaScript to render its content, Scrapy alone won’t be sufficient to scrape it. To handle websites that rely heavily on JavaScript, you’ll need to use a headless browser like Selenium or Splash.
For this to work, you’ll need to install chromedriver on your machine and selenium with pip.
Some example code:
You can adjust the sleep time according to the page loading time to ensure that the page is fully loaded before extracting the content.
In Scrapy, when extracting data from the response object, you typically use the
get()
orgetall()
methods on selectors to extract specific data and avoid including HTML elements.Try this kind of code.