Say there are 32 partitions in my Azure eventhub. I have a consumer trying to read 32 partitions and do checkpointing. But sometimes, the incoming messages to the eventhub might scale up. In order to manage the scale up, I was thinking of scaling the consumers. So it would be like 2 consumers from a consumer group reading 32 partitions each.
In the azure eventhub documentation, I read that it is always advisable to make only one active receiver on a partition within a consumer group otherwise there is a possibility of reading duplicate events.
My questions are,
-
Even with checkpointing, is it possible for two different consumers to read same event from the same partition?
-
If so, what is the best scenario to solve this when the incoming messages scales up?
Should I create a consumer to read only partitions 1-16 and another to read from 17-32 to handle the load?
2
Answers
What is your consumer? Are you using a VM application or Azure functions or something similar?
The checkpointing is at a partition level per consumer group. So only if you are using different consumer groups and you attempt to read the same partition, then you will definitely get duplicate events which you will need to handle. If you have 2 applications using the same consumer group and same partition you will not get duplicate events but you may get contention issues where both applications attempt to read the same partition but the partition has been locked by 1 application. This is handled automatically in Azure functions.
How sure are you that 2 consumers will be enough for future load? If you are using Azure functions and setup autoscaling, the contention issues are managed automatically and the servers will scale up either to your max setup or to the number of partitions depending on the load. If you are sure that you will need only 2 VMs/applications then yes, divide and conquer. Split the number of partitions between the 2. This is with the assumption that the event sender is sending the events without a partition key or a partition number. In this case, the events will be load balanced between all partitions.
Yes, it is possible to have multiple readers from the same consumer group reading the same partition/events. It all depends on the client that is reading and how it opened the AMQP link.
The Python library is a bit different than most languages in that there is only one consumer type –
EventHubConsumerClient
. Its behavior changes depending on how you configure it.If you pass it a
checkpoint_store
and set callbacks when invokingreceive
, it will function like an event processor. In this mode, it will request exclusive access to partitions and coordinate with other consumers using the same consumer group and storage location to share work between them. In this mode, each consumer group will only have one active reader.(example)
Additional context
From the service perspective, a consumer group is just a label that gets associated with a group of readers for an Event Hub. It is possible to ask the service to enforce exclusive access to a specific partition for readers in that group.
Checkpointing is not a service-side concept. The Azure SDK packages use this term to describe data that tracks where in a partition readers should resume reading from. The SDK includes the consumer group as part of that checkpoint data to differentiate from other logical groups of readers – but that’s a convention and not something enforced.
Each Azure SDK package offers an event processor type, or in Python and JavaScript, a mode of the consumer that acts like one. An event processor will ensure only a single reader of each partition for the consumer group + storage location combination. It will also coordinate with other processors using that consumer group and storage location to share work and equally distribute partition ownership.