I’m using EKS cluster with multiple different managed node groups of SPOT instances. I’m trying to make graceful shutdown on workloads on that nodes. I’m using ALB for balance input traffic. And Also I have deployments with graceful shutdown attributes like terminationGracePeriodSeconds
, preStop
, and readinessProbe
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ .Release.Name }}-{{ .Release.Namespace }}
namespace: {{ .Release.Namespace }}
labels:
app: {{ .Release.Name }}-{{ .Release.Namespace }}
type: instance
spec:
selector:
matchLabels:
app: {{ .Release.Name }}-{{ .Release.Namespace }}
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 10%
type: RollingUpdate
template:
metadata:
labels:
app: {{ .Release.Name }}-{{ .Release.Namespace }}
spec:
serviceAccountName: {{ .Release.Name }}-sa-{{ .Release.Namespace }}
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: type
operator: In
values:
- instance
topologyKey: node
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "eks.amazonaws.com/nodegroup"
operator: In
values:
- {{ .Values.nodegroup }}
containers:
- name: ai-server
lifecycle:
preStop:
exec:
command: [
"sh", "-c",
"sleep 20 && echo 1",
]
image: {{ .Values.registry }}:{{ .Values.image }}
command: [ "java" ]
args:
- -jar
- app.jar
readinessProbe:
httpGet:
path: /api/health
port: 8080
successThreshold: 1
periodSeconds: 10
initialDelaySeconds: 60
failureThreshold: 2
timeoutSeconds: 10
env:
- name: REDIS_HOST
value: redis-redis-cluster.{{ .Release.Namespace }}
- name: REDIS_PORT
value: "6379"
- name: REDIS_USER
value: default
- name: REDIS_PASSWORD
valueFrom:
secretKeyRef:
name: redis-redis-cluster
key: redis-password
- name: REDIS_TTL
value: {{ .Values.redis.ttl }}
resources:
requests:
memory: {{ .Values.resources.requests.memory }}
cpu: {{ .Values.resources.requests.cpu }}
limits:
memory: {{ .Values.resources.limits.memory }}
cpu: {{ .Values.resources.limits.cpu }}
ports:
- name: http
containerPort: 8080
imagePullPolicy: Always
terminationGracePeriodSeconds: 120
That approach provides me zero downtime updating and up and down scaling without any problems, without any errors on client side.
Unfortunately when SPOT node that serving pods of the deployemnt goes down for any reason like rebalance, clients get that error down below:
502 Bad Gateway
502 Bad Gateway
It happens because for some reason when node already in NotReady state, and cluster received event about that
Warning NodeNotReady pod/workload-f554999c9-7xkbk Node is not ready
pod is still in state READY for some period of time,
workload-f554999c9-7xkbk 1/1 Running 0 64m
and ALB continuing forward requests to that pod, that already not exists, until the pod just disappear.
Will be appreciate any ideas that help!
2
Answers
(Personal consideration) Spot instance are unpredicatable, you cannot rely on them for various cloud provider reasons.
Tell me what you think
Your issue is likely related to the fact you have not defined a
liveness
probe. According to the Kubernetes documentation:Your pod is definitely dead, but is not being checked by Kubernetes for liveness and so the pod removal is not being processed in a timely manner.
You can define liveness checks using the following snippet (once again from the document linked above):
Liveness probes are the best way to handle your specific question (a pod is dead, as the node has been removed from service). You can check for liveness using several probe types including
exec
(shown above), http, TCP, and GRPC queries.It is worth noting that your
readinessProbe
would detect the condition and stop routing traffic to your pods, but only after the detection period has elapsed.In your case, this will be
(periodSeconds * failureThreshold)
And as you have defined this as 10 seconds with 2 threshold, your system will only detect the pod unavailability at the earliest, in 20 seconds. If you expect high traffic and you need pods to be removed from service effectively, consider shortening your
periodSeconds
(at least).Note that depending on how many pods you have, this may put additional load on your Nodes as the Kubelet is doing more work to check the state of your pods. For small numbers of pods this should be fine.