Hi everyone, I need to create a service which could receive messages from a FIFO data stream and send the messages in order to each client server.
Let say the the data stream contains of A1, A2, A3, A4, B1, B2, B3, C1, C2, etc., then I need to send message A1, A2, A3, A4 sequentially to server A, message B1, B2, B3, B4 sequentially to server B, and so on. Each client server must receive the messages in order. The input data stream is guaranteed to be in order.
Requirements:
- Number of message pushed into the data stream is up to 10,000 messages per seconds.
- Number of different client servers is up to 1,000,000 and increasing. (A, B, C, …)
- Number of message sequence per client servers is up to 10 sequential message. (A1, A2, …, A10)
- Client servers need to receive messages in near real-time (less than 2 minutes after messages pushed into the data stream).
Here’s the problem I encountered:
There’re high chances of client servers are not responding. In that case, the service needs to wait and stop sending messages to that client server until the client server is ready to receive messages. (e.g. server A is down, then the service should stop sending messages to server A, but the service keeps continue sending messages to another non-down servers).
Here’s my current solution:
I was thinking about storing all messages from the input data stream into DB. Simultaneously, there’s a cron job that will select first message from the DB where the message is not sent order by their timestamp. The selected messages will be sent asynchronously to their respective client server.
However, I read many blogs online and they do not suggesting to store messages on DB (Never ever use a database as a message queue), thus I’m looking for another architecture suggestions, but couldn’t find one.
Do you anyone have any architecture suggestions for this?
Any third party services (AWS, GCP, Kafka, Ably, etc.) is allowed.
2
Answers
Given the constraints you mentioned around the server’s availability, I would think of a solution that uses one queue per server.
Basically your app reads from the FIFO queue, accumulates in-memory 10 messages per server (ex. for server A) , and pushes those messages in a dedicated queue that is consumed only by server A.
This way you switch the data transfer model from push-based to pull-based and you don’t have to worry about server A availability, because if it’s down, the messages will remain in its queue and when it gets back up, it will process the latest messages in the same order.
Now of course, you’d need 1M+ queues. If you use AWS SQS, which guarantees ordering for FIFO delivery logic, there are no limits in terms of the number of queues you can create per account, so at least technically it should work.
For those types of high-throughput, low-latency use-cases, I wouldn’t use a DB (especially a relational one), because it can quickly become a bottleneck.
Instead, I would check Redis Streams to see if it fits this use-case. Some people say it handles 200K streams pretty well, I think it can go up to 1M+ without any issues.
The use case you face is pretty clear a typical one for messaging solutions. Since you have one consumer per, let’s say, group of messages, then you can use either Message Queues (more suitable) or pub/sub. Especially the fact that the client might be irresponsive for a while, leads to this kind of communication for better performance.
Now, regarding technologies, you can choose from many available (especially since you dont have any message size/type requirements). For example kafka can handle the numbers you mention, ordering can be achieved fairly easily, it also stores messages if you like, there s a lot of support online, it is scalable, it is free.
Similar with rabbitmq and activemq. I ve not used any in high volumes, but in theory they can both handle your requirements, they re scalable, they re free.
If you re using cloud services, there are ready options that are fully managed, like AWS SQS as already mentioned, GoogleCloud pub/sub etc. So those are easier to use but are not free!
If you still dont want to try some new technology, even though it suits your case best, I would propose to get rid of the cronJob and have the clients make requests to DB. Because if 1M clients are unresponsive for a minute, the cronJob still does requests to DB and to 1M clients and this sounds redundant. You can use some Redis to map server to records from the DB and not even bother the main DB if there are no messages for this client. Of course, you should consider using queries performance best practises which apply well in your case, like indexing and partitioning.