Goal: I want to make a web scraper in a Rails app that runs indefinitely and can be scaled.
Current stack app is running on:
ROR/Heroku/Redis/Postgres
Idea:
I was thinking of running a Sidekiq Job that runs every n minutes and checks if there are any proxies available to scrape with (these will be stored in a table with status sleeping/scraping).
Assuming there is a proxy available to scrape it will then check (using Sidekiq API) if there is any available workers to start up another job to scrape with the available proxy.
This means i could scale the scraper by increasing number of workers and the number of available proxies. If for any reason the Job fails the Job that looks for available proxies will just start it again.
Questions: Is this the best solution for my goal? Is utilizing long running Sidekiq jobs the best idea or could this blow up?
2
Answers
If you want a job to run every n minutes, you could schedule it.
And since you’re using Heroku, there is an Add-on that : https://devcenter.heroku.com/articles/scheduler
Another solution would be to set cron jobs and schedule them with the whenever gem.
Sidekiq is designed to run individual jobs which are “units of work” to your organization.
You can build your own loop and, inside that loop, create jobs for each page to scrape but the loop itself should not be a job.