skip to Main Content

So I have a list of URLs that I pull from a database, and I need to crawl and parse through the JSON response of each URL. Some URLs return null, while others return information that is sent to a csv file. I’m currently using Scrapy, but it takes about 4 hours to scrape these 12000 URLs. I’ve looked into things like scrapy-redis, scrapy cluster, and frontera, but I’m not sure if those fit my use cases since they seem to be revolved around scraping found URLs on websites.

Is 4 hours a "normal" time for this amount of URLs on a single machine scrape? Or are there any packages that might fit me better, where there is no "following" links when pages are scraped?

3

Answers


  1. Are the URLs and output independent of each other? You could setup Python multiprocessing, and do this in parallel, then combine the output at the end. The number of processes is up to you, but would allow you to use more than 1 core of your machine.

    https://docs.python.org/2/library/multiprocessing.html

    Also, do you need to load the content, or can you just use the response codes to tell you if the server is responding at that URL etc.

    If you are going to be doing a lot of this kind of work, and want fast processing, Golang has excellent support for web services and parallelisation.

    Login or Signup to reply.
  2. I doubt you will find a much faster way than scrapy. It has great tools for crawling a site but it can also just be used for scraping a list of known url:s. And it is useful for scraping jsons too. Just make sure you use concurrent requests for processes several pages at the same time. If you risk getting blocked because of multiple requests in a short period of time you can use rotating proxies kike https://github.com/TeamHG-Memex/scrapy-rotating-proxies or use a scraping VPN like crawlera. 4 hours for only 12k of url:s sound like alot.

    How are you scraping the json files with Scrapy?

    This code would scrape a json-file from the swedish innovation agency Vinnova containg all projects that been granted financing from the agency and output the titles of all the projects:

    import scrapy
    import json
    
    
    class TestscraperSpider(scrapy.Spider):
        name = 'testScraper'
        allowed_domains = ['vinnova.se']
        start_urls = [
            'https://www.vinnova.se/sok-finansiering/hitta-finansiering/search/']
    
        def parse(self, response):
    
            jsonresponse = json.loads(response.body_as_unicode())
            titles = [project['Heading']
                      for project in jsonresponse['FindHitList']]
            yield {"titles": titles}
    
    

    If you have more than one json-file to scrape you can just add more url:s in the list. You can do this in three different ways.

    1. Manual adding

    You can just copy and paste more url:s into the list. Probably not the best way to do it if you got 12k of url:s.

    start_urls = [
            'domain.com/link1', 'domain.com/link2', 'domain.com/link3', 'domain.com/link4',]
    

    2. Get start url:s from an external source

    You can just override the start_urls by writing custom init like this:

    def __init__(self):
        # Get list of urls from external source 
        self.start_urls = data_external
    
    

    3. Use a custom method for requesting url:s

    Here you just bypass start_url:s and just call scrapy:s request method manually for each link.

    from scrapy.http import Request
    
    def start_requests(self):
        # Get url:s from external source. 
        for url in urls:
            yield Request(url)
    

    In your case you can probably use either 2 or 3. Shouldn’t matter when it is that few url:s.

    Login or Signup to reply.
  3. You can FireScraper, https://firescraper.com/.
    Its a good tool for scraping text from multiple URLs. Because part is that its not running on your machine and its a little faster than other tools I tried.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search