skip to Main Content

I have an ECS service that I want to scale up and down depending on how many items are in an SQS queue.

resource "aws_cloudwatch_metric_alarm" "sqs_scale_up" {
  alarm_name = "scale-up"

  comparison_operator       = "GreaterThanOrEqualToThreshold"
  evaluation_periods        = "1"
  metric_name               = "ApproximateNumberOfMessagesVisible"
  namespace                 = "AWS/SQS"
  period                    = "60"
  threshold                 = "1"
  statistic                 = "Sum"
  alarm_description         = "Increase task count"
  insufficient_data_actions = []
  alarm_actions             = [aws_appautoscaling_policy.scale_up.arn]

  dimensions = {
    QueueName = aws_sqs_queue.this.name
  }
}

resource "aws_cloudwatch_metric_alarm" "sqs_scale_down" {
  alarm_name          = "scale-down"
  comparison_operator = "LessThanThreshold"
  evaluation_periods  = "1"
  metric_name         = "ExactNumberOfMessagesVisible"
  namespace           = "AWS/SQS"
  period              = "60"
  threshold           = "1"
  statistic           = "Sum"
  alarm_description   = "Decrease task count"
  alarm_actions       = [aws_appautoscaling_policy.scale_down.arn]

  dimensions = {
    QueueName = aws_sqs_queue.this.name
  }
}

The fact that I have 1 alarm for count>0 and 1 alarm for count<1 means that one of these alarms will be be in the alarm state?

Is this normal?

2

Answers


  1. Don’t panic over the word ‘ALARM‘. Instead, think of it as saying that the condition is TRUE.

    If there are any messages in the queue, you presumably want to scale-out from a "nothing is running" state. Therefore, you want the scale-out alarm to be TRUE. However, you need to set a limit so that it doesn’t continually scale — it might just need one pod.

    When the queue is empty, you want to scale-in. However, you don’t want to flip-flop between the two states. The general rule is "scale-out quickly, but scale-in slowly". Therefore, the rule should use a longer evaluation period before deciding to scale-in (eg 10 minutes).

    Thus, there might not always be an alarm in the TRUE (ALARM) state. If there are no messages in the queue, then the scale-out alarm will be FALSE. Plus, if the sum of ExactNumberOfMessagesVisible over the previous 10 minutes is not zero, then the scale-in alarm won’t be TRUE either. Instead, both alarms will be FALSE so nothing will be changing at that time. This is good.

    Login or Signup to reply.
  2. Such settings count>=1 and count<1 looks little bit odd, but it depends on what you want to achieve.
    It might be ok if you need scale-down ECS to 0 in case if you don’t have message to process and provision some resources otherwise.

    There are caveats associated with such approach, in the worst case when messages come, processing will be delayed for 60 sec + ECS provisioning time.

    From other hand, if you want to have ongoing message processing and always have some bare minimum of workers on duty processing messages, but you need to scale up, try to adjust threshold value accordingly to values that is greater than average amount of messages in queue + some threshold.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search