I want to build a product the can perform some Internet scans (in Python) to collect various kinds of data.
I want to design it with tasks that perform these collecting jobs.
There can be multiple scans that run parallelly on different inputs, so tasks can be duplicated, since they have different inputs to operate on.
I wonder which architecture would fit for it, what technologies are the best.
I thought of using RabbitMQ to store the tasks and Redis to store inputs.
The initial inputs trigger the scan, then each task spits his output that might be the input for other tasks.
What do you think of this possible design? Can it be improved? Other technologies?
2
Answers
It depends on the size of the inputs. If those are relatively small I would go with just message broker and sending everything in the message (i.e. the task type and it’s inputs) – otherwise some outside store is better used. Depending on the durability requirements possibly a persistent storage (like database) should be considered.
One option is to use an existing orchestrator which hides most of the complexity instead of crafting a custom solution based on queues and storage. Look at temporal.io open source project which allows orchestrating tasks using high level programming language.