I built a service that utilizes docker pods to process data. The time it takes varies from as little as 15 minutes to as much as 1 hour.
My applications captures SIGTERM to ensure a graceful shutdown takes place when demand drops while Pods and Nodes are decommissioned.
In each docker image I placed code to report back if it shutdown because it completed the work and if a SIGTERM event took place and thus completed its processing and terminated.
My system is deployed in AWS using EKS. I use EKS to manage node deployment when demand goes up and spindown nodes when demand drops. I use KEDA to manage POD deployment which is what helps trigger whether additional nodes are needed or not. In KEDA I have the cooldownPeriod defined for 2 hours the maximum I expect a pod to take even though the max it would ever take is 1 hour.
In AWS EKS, I have defined the terminationGracePeriodSeconds for 2 hours as well.
I isolated the issue during Node scale down that when nodes are being terminated, the terminationGracePeriodSeconds is not being honored and my Pods are being shutdown within ~30 minutes. Because the Pods are abruptly removed I am unable to look at their logs to see what happened.
I tried to simulate this issue by issuing a kubernetes node drain and kept my pod running
kubectl drain <MY NODE>
I saw the SIGTERM come through, and I also noticed that the pod was only terminated after 2 hours and not before.
So for a brief minute I thought maybe I did not configure the terminationGracePeriod properly, so I checked:
kubectl get deployment test-mypod -o yaml|grep terminationGracePeriodSeconds
terminationGracePeriodSeconds: 7200
I even redeployed the config but that made no difference.
However, I was able to reproduce the issue by modifying the desiredSize of the Node group. I can reproduce it programmatically in Python by doing this:
resp = self.eks_client.update_nodegroup_config(clusterName=EKS_CLUSTER_NAME,
nodegroupName=EKS_NODE_GROUP_NAME,
scalingConfig={'desiredSize': configured_desired_size})
or by simply going to AWS console and modifying the desiredSize there.
I see EKS choosing a node and if it happens that there is a pod processing data that will take about an hour, the pod is sometimes prematurely terminated.
I have logged on to that node that is being scaled down and found no evidence of the prematurely terminated Pod in the logs.
I was able to capture this information once
kubectl get events | grep test-mypod-b8dfc4665-zp87t
54m Normal Pulling pod/test-mypod-b8dfc4665-zp87t Pulling image ...
54m Normal Pulled pod/test-mypod-b8dfc4665-zp87t Successfully pulled image ...
54m Normal Created pod/test-mypod-b8dfc4665-zp87t Created container mypod
54m Normal Started pod/test-mypod-b8dfc4665-zp87t Started container mypod
23m Normal ScaleDown pod/test-mypod-b8dfc4665-zp87t deleting pod for node scale down
23m Normal Killing pod/test-mypod-b8dfc4665-zp87t Stopping container mypod
13m Warning FailedKillPod pod/test-po-b8dfc4665-zp87t error killing pod: failed to "KillContainer" for "mypod" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
I once saw a pod removed for no reason as such when scaledown was disabled but it decided to remove my pod:
kubectl get events | grep test-mypod-b8dfc4665-vxqhv
45m Normal Pulling pod/test-mypod-b8dfc4665-vxqhv Pulling image ...
45m Normal Pulled pod/test-mypod-b8dfc4665-vxqhv Successfully pulled image ...
45m Normal Created pod/test-mypod-b8dfc4665-vxqhv Created container mypod
45m Normal Started pod/test-mypod-b8dfc4665-vxqhv Started container mypod
40m Normal Killing pod/test-mypod-b8dfc4665-vxqhv Stopping container mypod
This is the kuberenets version I have
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.0" GitCommit:"9e991415386e4cf155a24b1da15becaa390438d8", GitTreeState:"clean", BuildDate:"2020-03-25T14:58:59Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.20-eks-8c49e2", GitCommit:"8c49e2efc3cfbb7788a58025e679787daed22018", GitTreeState:"clean", BuildDate:"2021-10-17T05:13:46Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
To minimize this issue, I deployed a Pod Disruption Budget during peak hours to block scale down and in the evening during low demand I remove the PDB which initiates the scaledown. However, that is not the right solution and even during low peak there are still pods that get stopped prematurely.
3
Answers
When using Amazon EKS, the node autoscaler does not honor the terminationGracePeriodSeconds. Per
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#does-ca-respect-gracefultermination-in-scale-down
The Node Autoscaler only provides a 10 minute grace period. I extracted the relevant text here:
How fast is Cluster Autoscaler?
By default, scale-up is considered up to 10 seconds after pod is marked as unschedulable, and scale-down 10 minutes after a node becomes unneeded. There are multiple flags which can be used to configure these thresholds. For example, in some environments, you may wish to give the k8s scheduler a bit more time to schedule a pod than the CA's scan-interval. One way to do this is by setting --new-pod-scale-up-delay, which causes the CA to ignore unschedulable pods until they are a certain "age", regardless of the scan-interval. If k8s has not scheduled them by the end of that delay, then they may be considered by the CA for a possible scale-up.
Another relevant link: https://github.com/kubernetes/autoscaler/issues/147
I implemented a script to be invoked as a preStop Hook that will hopefully block the next state that issues the SIGTERM and starts the 10 minute countdown to give me a chance to gracefully shutdown my service. However, the preStopHook does not delay the 10 minute timer.
Some references to that setup:
https://www.ithands-on.com/2021/07/kubernetes-101-pods-lifecycle-hooks_30.html
https://kubernetes.io/docs/tasks/configure-pod-container/attach-handler-lifecycle-event/
Instead, I added the following annotation to my pod deployment config, per the following reference:
https://aws.github.io/aws-eks-best-practices/cluster-autoscaling/#prevent-scale-down-eviction
Then I ensured that my my pods become on demand pods, i.e. no pods are deployed idle as idle pods impact EKS scale down and only spawned when needed and shutdown when their task is done. This slows my response time for jobs, but that is a smaller price to pay relative to shutting down a Pod amid an expensive compute.
In case anyone is curious on how to deploy an AWS Cluster Autoscaler: https://docs.aws.amazon.com/eks/latest/userguide/autoscaling.html#cluster-autoscaler
It has a reference on also disabling evictions of Pods
Under load we are still seeing that the safe-to-evict annotation is not being honored and reported this back to Amazon AWS. With additional debugging I was able to discover that EKS is seeing nodes hosting the pods disappearing despite EKS ignoring nodes with the safe to evict. There might be an interoperability issue between EKS and EC2. Until this is resolved I am looking into using Fargate as an alternate autoscaler.
We faced the same issue with AWS EKS and cluster-autoscaler – nodes were unexpectedly shut down, no preventive actions were working, and even the node annotation
cluster-autoscaler.kubernetes.io/scale-down-disabled=true
did not make any difference.After two days of troubleshooting, we found the reason – it was because we use Multiple Availability Zone in ASG configuration, which has an automatic "AZRebalance" process. The AZRebalance tries to ensure that the number of nodes is approximately the same between all availability zones. Therefore, sometimes when the scale-up event occurs, it tries to rebalance nodes by killing one node and creating another in a different time zone. The message in the events log looks like this:
Cluster-autoscaler does not control this process, so there are two systems (cluster-autoscaler and AWS ASG) that manage the number of nodes simultaneously, which leads to unexpected behavior.
As a workaround, we suspended the "AZRebalance" process in the ASG.
Another solution would be to use ASG for each availability zone separately and use
--balance-similar-node-groups
feature in the cluster-autoscaler.Here’s the article about that and here’s the cluster-autoscaler documentation.
We worked with Amazon support to solve this issue. The final resolution was not far from @lub0v answer but there was still a missing component.
Our EKS system had only one node group that managed multiple Availability Zones. Instead I deployed one node group per Availability Zone. Once we did that the TerminationGracePeriod was being honored.
Also, don’t forget prior answers I added earlier, ensure your pod annotation contains safe-to-evict set as false
Finaly, use –balance-similar-node-groups in your cluster autoscaler command line parameter if you prefer to have the same number of nodes deployed during upscaling. Currently this parameter is not honored during downscaling.
Reference on autoscaling: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md