I created an event rule for the Sagemaker training job state change in cloudwatch to monitor my training jobs. Then I use this events to trigger a lambda function that send messages in a telegram group as a bot. In this way I receive a message every time one of the training job change its status. It works but there is a problem with the events, they are fired multiple times with the same exact payload, so I receive tons of duplicate messages.
Since all the payploads are identical (except the field LastModifiedTime
) I cannot filter them in the lambda. Unfortunately I don’t have the AWS Developer plan so I cannot receive support from Amazon. Any idea?
EDIT
There are no duplicate rules/events. I also noticed that enabling the Sagemaker profiler (which is now by default) cause the number of identical rule invocations literally explode. All of them have the same payload except for the LastModifiedTime
so I suspect that there is a bug in AWS for that. One solution could be to implement some sort of data retention on the lambda and check if an invocation has already been processed, but I don’t want complicate a thing that should be very simple. Just tried to launch a new training job and got this sequence (I only report the fields I parse):
Status: InProgress
Secondary Status: Starting
Status Message: Launching requested ML instances
Status: InProgress
Secondary Status: Starting
Status Message: Starting the training job
Status: InProgress
Secondary Status: Starting
Status Message: Starting the training job
Status: InProgress
Secondary Status: Starting
Status Message: Starting the training job
Status: InProgress
Secondary Status: Starting
Status Message: Preparing the instances for training
Status: InProgress
Secondary Status: Downloading
Status Message: Downloading input data
Status: InProgress
Secondary Status: Training
Status Message: Downloading the training image
Status: InProgress
Secondary Status: Training
Status Message: Training in-progres
Status: InProgress
Secondary Status: Training
Status Message: Training image download completed. Training in progress
2
Answers
After a lot of experiments I can answer myself that Sagemaker generates multiple events with the same payload, except for the field
LastModifiedTime
. I don't know is this is a bug, but should not happen in my opinion. These are rules defined by AWS itself, so nothing I can customize. The situation is even worse if you enable the profiler. There is nothing I can do, since I already posted on the official AWS forum multiple times without any luck.Duplicate messages can happen but should be very rare. You should check if there’s any duplicate rules / schedules. You can use metrics to identify what’s being invoked / matched https://docs.aws.amazon.com/eventbridge/latest/userguide/eventbridge-monitoring-cloudwatch-metrics.html.
Another reason maybe your rules are too broad and matching multiple events of the same source. You can create another target on the same rule to Cloudwatch Logs, to see which events get matched and if there needs to be any filtering.
It’s also possible the sagemaker just sends duplicate events to EventBridge, in which case your best option would be to us ElastiCache to temporarily store the ids and check against in your lambda.