skip to Main Content

Problem statement : Target group is showing health status as Initial and website is giving 502 Bad Gateway and 504 gateway timeout error (It is happening intermittently). But if I re-registered the same target group manually, website works fine.

How to Reproduce ?

I have a deployment file (ex: aws.yml), which is containing a job to register a container IP in ELB fargate.

Here is the sample configuration :

name: Register Container IP in ELB Fargate
id: register-ip
env:
  ECS_CLUSTER: ${{ env.ECS_CLUSTER }}
  AWS_TARGET_GROUP_ARN: ${{ inputs.AWS_TARGET_GROUP_ARN }}
run: |
  echo "ECS_CLUSTER: $ECS_CLUSTER"
  echo "AWS_TARGET_GROUP_ARN: $AWS_TARGET_GROUP_ARN"
      
  aws ecs list-tasks --cluster $ECS_CLUSTER
  TASK_ARN=$(aws ecs list-tasks --cluster $ECS_CLUSTER --query "taskArns[0]" --output text)
  echo "TASK_ARN: $TASK_ARN"
      
  aws ecs describe-tasks --tasks "$TASK_ARN" --cluster $ECS_CLUSTER
  TASK_IP=$(aws ecs describe-tasks --tasks "$TASK_ARN" --cluster $ECS_CLUSTER --query "tasks[0].containers[0].networkInterfaces[0].privateIpv4Address" --output text)
  echo "TASK_IP: $TASK_IP"
      
  aws elbv2 register-targets --target-group-arn "$AWS_TARGET_GROUP_ARN" --targets Id=$TASK_IP,Port=3000  

This job runs successfully and also register the newly generated TASK_IP in load balancer under registered targets in target groups. But as i mentioned earlier

Health status is showing β€œInitial” and site is not loading up. But if
I re-registered the same target group manually, website works fine.

Here is the health check settings of EC2 -> Target Groups -> <Target Group Name> :

enter image description here

This is the registered target screenshot :

enter image description here

And here is the load balancer mapping screenshot :

enter image description here

Can someone please help me to resolve this issue ?

Update : I am able to fix the health status issue by updating the heath check settings to this.

enter image description here

Now health status is showing as Healthy for the new registered IP πŸ™‚

Final Update / Latest problem statement :

Whenever a new task IP is registered by the new deployment in the ELB Fargate target groups, the old running task IP immediately goes into an unhealthy state. A new task IP is then assigned, which takes a few minutes to become healthy.

The issue is that during the time it takes for the new IP to become healthy, the website experiences downtime and shows 502 and 504 errors.

This is my portMappings and networkMode setting in task definition :

{
  ...
  "networkMode": "awsvpc",
  ...
  "containerDefinitions": [
    {
      "name": "XXXXX",
      ...
      "portMappings": [
        {
          "containerPort": 3000,
          "hostPort": 3000,
          "protocol": "tcp"
        }
      ]
      ... 
    }
  ],
  "requiresCompatibilities": [
    "FARGATE"
  ]
  ...
}

Is there a way to achieve zero downtime during this transition, ensuring that the website does not show these errors while the new IP is becoming healthy?

2

Answers


  1. If you have a single task per registered target it makes sense that you get some 502/504 errors while updating your task, since it cannot run and update at the same time. To make sure this is your issue, you could try to perform some requests while updating, you should not get any 200.

    To overcome this you probably want to deploy via an ECS service as stated by Mark B and use rolling deployment. This will allow you to always have a running instance even during deployment.

    Login or Signup to reply.
    1. Select Target Group into Load Balancer and Verify health check settings
    2. path: Verify the path matches the endpoint that should respond with a 200 OK status
    3. Timeout and Interval : Adjust it according to the container start time because if you having heavy applications or container size more then uptime of the container also more like 90 Seconds or more in that case of health check happening it definitely return 500 or 504 Errors
    4. Change the Threshold value based on your Application behaviour
    5. Also review the load balancer listener configuration and Routing rules and conditions to ensure that no conflict in it.
    6. We faced similar issues with ECS fargate we set the alert. We have to restart the task or delete the current running container to resolve the transient issues
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search