skip to Main Content

I have an ECS service with 1 Fargate task. There is an ALB that routes traffic to it. This task is receiving a continuous stream of traffic to one endpoint from a load tester. I noticed that whenever I redeploy the same task definition, there is either a jump or a drop in average service CPU, and then it seems to reach steady state at the new lower or higher CPU. I’ve been checking different metrics and logs and can’t seem to find a pattern, as the number of incoming requests stays relatively stable, and no warnings or errors are being thrown in the logs.

Would anyone have any ideas of what to explore?

Average CPU

Edit

Here is roughly the task definition used for the task above (stripped of any null or empty values, or otherwise descriptive ones like ‘family’). Note that it is a 2 container task: (1) an application, (2) an nginx sidecar

{
  "networkMode": "awsvpc",
  "cpu": "1024",
  "memory": "2048",
  "compatibilities": [
    "EC2",
    "FARGATE"
  ],
  "containerDefinitions": [
    {
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": <options>
      },
      "portMappings": <ports>
      "image": <image>,
      "name": "app"
    },
    {
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": <options>
      },
      "portMappings": <ports>,
      "image": <http_image>,
      "dependsOn": [
        {
          "containerName": "app",
          "condition": "START"
        }
      ],
      "essential": true,
      "name": "http"
    }
  ],
  "requiresAttributes": [
    {
      "name": "com.amazonaws.ecs.capability.logging-driver.awslogs"
    },
    {
      "name": "ecs.capability.execution-role-awslogs"
    },
    {
      "name": "com.amazonaws.ecs.capability.ecr-auth"
    },
    {
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.19"
    },
    {
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.17"
    },
    {
      "name": "com.amazonaws.ecs.capability.task-iam-role"
    },
    {
      "name": "ecs.capability.container-ordering"
    },
    {
      "name": "ecs.capability.execution-role-ecr-pull"
    },
    {
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
    },
    {
      "name": "ecs.capability.task-eni"
    }
  ],
  "requiresCompatibilities": [
    "FARGATE"
  ]
}

2

Answers


  1. This may be a mis-configuration of your task definition. Are you able to post that here? Here is a doc on ecs resource tuning

    Login or Signup to reply.
  2. You are probably mis-reading the graph, judging from the fact that average CPU drops or jumps with a factor of 2.

    When you redeploy while your task is at 70%, ECS will launch a new task (using a new version of the image). When the second task is launched, you have 2 containers sharing the load (well, as far as the graph is concerned since it might not yet receive actual traffic). So the average CPU becomes 35% (the 50% drop).

    When the new container is stable (probes, smoke tests etc pass) ECS will drain out and shutdown the old task. When this happens, you will notice a 100% jump in CPU average, only because the average is now calculated for one (not two) tasks.

    You can verify (or not) this by correlating the jumps and drops shown on the graph with log times for tasks being added/removed while redeploying.

    Same applies when ECS auto-scaling is used: As tasks are added, average CPU usage is dropping.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search