As I got right from book, Kafka Streams state store it is a memory key/value storage to store data to Kafka or after filtering.
I am confused by some theoretical questions.
- What is differenct Kafka streams state from another memory storage like Redis etc?
- What is real case to use state storage in Kafka Streams?
- Why topic is not alternative for state storage?
2
Answers
A topic contains messages in a sequential order that typically represents a log.
Sometimes, we would want to aggregate these messages, group them and perform an operation, like sum, for example and store it in a place which we can retrieve later using a key. In this case, an ideal solution would be to use a key-value store rather than a topic that is a log-structure.
A simple use-case would be word count where we have a word and a counter of how many times it has occurred. You can see more examples at kafka-streams-examples on github.
State can be considered as a savepoint from where you can resume your data processing or it might also contain some useful information needed for further processing (like the previous word count which we need to increment), so it can be stored using Redis, RocksDB, Postgres etc.
Redis can be a plugin for Kafka streams state storage, however the default persistent state storage for Kafka streams is RocksDB.
Therefore, Redis is not an alternative to Kafka streams state but an alternative to Kafka streams’ default RocksDB.
-Why topic is not alternative for state storage?
Topic is the final statestore storage under the hood (everything is topic in kafka)
If you create a microservice with name "myStream" and a statestore named "MyState", you’ll see appear a myStream-MyState-changelog with has an history of all changes in the statestore.
RocksDB is only the local cache to improve performances, with a first layer of local backup on the local disk, but at the end the real high availability and exactly-once processing guarantee is provided by the underlying changelog topic.
It not a storage, it’s a just local, efficient, guaranteed memory state to manage some business case is a fully streamed way.
As an example :
For each Incoming Order (Topic1), i want to find any previous order (Topic2) to the same location in the last 6 hours