We are using Azure Event Hub (EH). We were testing a retry policy, to see what happens when EH doesn’t have enough TUs to handle the load (both receiving and sending). We sent 3M events (in batches of 1K) to EH with 1TU and 32 partitions. 1 event is ~1KB in size. We have a single EH within a namespace, with a single consumer client consuming events (in batches of 100). Our retry policy to consume events is currently to retry 120 times with 1s delay (we are aware this isn’t appropriate, but we’re testing the current setup). What we observed was:
- 3.216M events received by EH (EH incoming messages metric)
- 3.009M events sent from EH (EH outgoing messages metric)
- 3M events received by the consumer. (sum of differences between sequence numbers before and after on every partition)
- 216k failing requests (EH namespace failed requests metric)
- 39k successful requests (EH successful requests metric)
- 1.178k throttled requests (EH throttled requests metric)
- The whole processing took 50 minutes.
- No application retstarts happened. Also checkpointing happens as soon as the messages batch is processed by the consumer.
My questions are:
- Does incoming EH events metric include failed to receive events? The math adds up, however it isn’t practical, because physically those events didn’t end up in the EH.
- How to explain 9K excess in events sent from EH? What correlation does it have with failing requests?
- 1 request should include a batch of events (because we send in batches), not a single event. Why does a failed request map 1:1 to extra ingress?
Thank you for the help!
2
Answers
Reached out to MS support. For my particular case, this was the reason:
Azure Event Hubs typically counts all events that are sent to the hub, including those that may have failed to be received due to various issues. This means that while the metric reflects the total number of events sent, it does not differentiate between successfully received events and those that failed to reach the Event Hub due to network issues or other failures. If the entire batch fails due to a transient issue (like network problems or throttling), the system may attempt to resend that same batch. Each retry counts as a new ingress event (regardless of the fact that it may be a batch with multiple events), leading to an increase in the total number of events sent.
The short version is that metrics will never be a 1:1 alignment. You should always expect to see a higher number of outgoing messages than incoming for metrics when all events in the partition have been processed.
There are a number of reasons for this, including:
Clients invoking service operations have intermittent failures and retry. Clients will always prioritize data safety and resend events for processing if they cannot authoritatively prove those events were seen by your application.
Clients generally prefetch events and hold them in a memory queue to enable consistent throughput. Any time the client is closed or an intermittent network issue causes a connection/link to be recreated, the prefetch cache is refreshed.
Ownership of a partition moves from one Event Processor to another, which causes the stream to rewind to the most recently recorded checkpoints. The new owner will re-read data that the old owner was already holding in memory.
This is even more of a disparity when you’re looking at the incoming and outgoing bytes metrics. Because the service owns certain data on events – such as the offset, sequence number, enqueued time – it adds additional data to the event for receivers. You should always expect a higher number of outgoing bytes.