skip to Main Content

I am running a couple of crawlers that produce millions of datasets per day. The bottleneck is the latency between the spiders and the remote database. In case the location of the spider server is too large, the latency will slow the crawler down to a point where it can not longer complete the datasets needed for a day.

In search for a solution I came upon redis with the idea on installing redis the spider server where it will temporarily store the data collected with low latency and then redis will pull that data to mysql some how.

The setup is like this until now:

  • About 40 spiders running on multiple instances feed one central MySQL8 remote server on a dedicated machine over TCP/IP.
  • Each spider writes different datasets, one kind of spider gets positions and prices of search results, where there are 100 results with around 200-300 inserts on one page. Delay is about 2-10s between the next request/page.

The later one is the problem as the spider yields every position within that page and creates a remote insert within a transaction, maybe even a connect (not sure at the moment).

This currently only works as spiders and remote MySQL server are close (same data center) with ping times of 0.0x ms, it does not work with ping times of 50ms as the spiders can not write fast enough.

Is redis or maybe DataMQ a valid approach to solve the problem or are there other recommended ways of doing this?

2

Answers


  1. I recommend using MongoDB instead of MySQL to store data

    Login or Signup to reply.
  2. Did you mean you have installed a Redis Server on each spider?

    Actually it was not a good solution for you case. But if you have already done this and still want to use MySQL to persistent your data, cronjob on each server will be an option.

    You can create a cronjob on each spider server(based on your dataset and your need, you can choose daily or hourly sync job). And write a data transfer script to scan your Redis and transfer to MySQL tables.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search