Problem statement : Target group is showing health status as Initial
and website is giving 502 Bad Gateway and 504 gateway timeout error (It is happening intermittently). But if I re-registered the same target group manually, website works fine.
How to Reproduce ?
I have a deployment file (ex: aws.yml), which is containing a job to register a container IP in ELB fargate.
Here is the sample configuration :
name: Register Container IP in ELB Fargate
id: register-ip
env:
ECS_CLUSTER: ${{ env.ECS_CLUSTER }}
AWS_TARGET_GROUP_ARN: ${{ inputs.AWS_TARGET_GROUP_ARN }}
run: |
echo "ECS_CLUSTER: $ECS_CLUSTER"
echo "AWS_TARGET_GROUP_ARN: $AWS_TARGET_GROUP_ARN"
aws ecs list-tasks --cluster $ECS_CLUSTER
TASK_ARN=$(aws ecs list-tasks --cluster $ECS_CLUSTER --query "taskArns[0]" --output text)
echo "TASK_ARN: $TASK_ARN"
aws ecs describe-tasks --tasks "$TASK_ARN" --cluster $ECS_CLUSTER
TASK_IP=$(aws ecs describe-tasks --tasks "$TASK_ARN" --cluster $ECS_CLUSTER --query "tasks[0].containers[0].networkInterfaces[0].privateIpv4Address" --output text)
echo "TASK_IP: $TASK_IP"
aws elbv2 register-targets --target-group-arn "$AWS_TARGET_GROUP_ARN" --targets Id=$TASK_IP,Port=3000
This job runs successfully and also register the newly generated TASK_IP
in load balancer under registered targets in target groups. But as i mentioned earlier
Health status is showing “Initial” and site is not loading up. But if
I re-registered the same target group manually, website works fine.
Here is the health check settings of EC2
-> Target Groups
-> <Target Group Name>
:
This is the registered target screenshot :
And here is the load balancer mapping screenshot :
Can someone please help me to resolve this issue ?
Update : I am able to fix the health status issue by updating the heath check settings to this.
Now health status is showing as Healthy
for the new registered IP 🙂
Final Update / Latest problem statement :
Whenever a new task IP is registered by the new deployment in the ELB Fargate target groups, the old running task IP immediately goes into an unhealthy state. A new task IP is then assigned, which takes a few minutes to become healthy.
The issue is that during the time it takes for the new IP to become healthy, the website experiences downtime and shows 502 and 504 errors.
This is my portMappings
and networkMode
setting in task definition :
{
...
"networkMode": "awsvpc",
...
"containerDefinitions": [
{
"name": "XXXXX",
...
"portMappings": [
{
"containerPort": 3000,
"hostPort": 3000,
"protocol": "tcp"
}
]
...
}
],
"requiresCompatibilities": [
"FARGATE"
]
...
}
Is there a way to achieve zero downtime during this transition, ensuring that the website does not show these errors while the new IP is becoming healthy?
2
Answers
If you have a single task per registered target it makes sense that you get some 502/504 errors while updating your task, since it cannot run and update at the same time. To make sure this is your issue, you could try to perform some requests while updating, you should not get any 200.
To overcome this you probably want to deploy via an ECS service as stated by Mark B and use rolling deployment. This will allow you to always have a running instance even during deployment.