skip to Main Content

I’m using EKS cluster with multiple different managed node groups of SPOT instances. I’m trying to make graceful shutdown on workloads on that nodes. I’m using ALB for balance input traffic. And Also I have deployments with graceful shutdown attributes like terminationGracePeriodSeconds, preStop, and readinessProbe

apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ .Release.Name }}-{{ .Release.Namespace }}
  namespace: {{ .Release.Namespace }}
  labels:
    app: {{ .Release.Name }}-{{ .Release.Namespace }}
    type: instance
spec:
  selector:
    matchLabels:
      app: {{ .Release.Name }}-{{ .Release.Namespace }}
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 10%
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: {{ .Release.Name }}-{{ .Release.Namespace }}
    spec:
      serviceAccountName: {{ .Release.Name }}-sa-{{ .Release.Namespace }}
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
                - key: type
                  operator: In
                  values:
                    - instance
            topologyKey: node
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: "eks.amazonaws.com/nodegroup"
                    operator: In
                    values:
                      - {{ .Values.nodegroup }}
      containers:
        - name: ai-server
          lifecycle:
            preStop:
              exec:
                command: [
                  "sh", "-c",
                  "sleep 20 && echo 1",
                ]
          image: {{ .Values.registry }}:{{ .Values.image }}
          command: [ "java" ]
          args:
            - -jar
            - app.jar
          readinessProbe:
            httpGet:
              path: /api/health
              port: 8080
            successThreshold: 1
            periodSeconds: 10
            initialDelaySeconds: 60
            failureThreshold: 2
            timeoutSeconds: 10
          env:
            - name: REDIS_HOST
              value: redis-redis-cluster.{{ .Release.Namespace }}
            - name: REDIS_PORT
              value: "6379"
            - name: REDIS_USER
              value: default
            - name: REDIS_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: redis-redis-cluster
                  key: redis-password
            - name: REDIS_TTL
              value: {{ .Values.redis.ttl }}
          resources:
            requests:
              memory: {{ .Values.resources.requests.memory }}
              cpu: {{ .Values.resources.requests.cpu }}
            limits:
              memory: {{ .Values.resources.limits.memory }}
              cpu: {{ .Values.resources.limits.cpu }}
          ports:
            - name: http
              containerPort: 8080
          imagePullPolicy: Always
      terminationGracePeriodSeconds: 120

That approach provides me zero downtime updating and up and down scaling without any problems, without any errors on client side.

Unfortunately when SPOT node that serving pods of the deployemnt goes down for any reason like rebalance, clients get that error down below:

502 Bad Gateway

502 Bad Gateway

It happens because for some reason when node already in NotReady state, and cluster received event about that

Warning   NodeNotReady   pod/workload-f554999c9-7xkbk  Node is not ready

pod is still in state READY for some period of time,

workload-f554999c9-7xkbk         1/1     Running   0             64m

and ALB continuing forward requests to that pod, that already not exists, until the pod just disappear.

Will be appreciate any ideas that help!

2

Answers


  1. (Personal consideration) Spot instance are unpredicatable, you cannot rely on them for various cloud provider reasons.

    1. In your deployment you are missing the liveness probe for the pod.
    2. You can deep dive in Node taints. Taints and tolerations work together to ensure that Pods aren’t scheduled onto inappropriate nodes
    3. You can deep dive on ASG group. I think this i probably the way i would move if you dont have designed an autoscaling group associated for nodes and target groups.

    Tell me what you think

    Login or Signup to reply.
  2. Your issue is likely related to the fact you have not defined a liveness probe. According to the Kubernetes documentation:

    Many applications running for long periods of time eventually transition to broken states, and cannot recover except by being restarted. Kubernetes provides liveness probes to detect and remedy such situations.

    Your pod is definitely dead, but is not being checked by Kubernetes for liveness and so the pod removal is not being processed in a timely manner.

    You can define liveness checks using the following snippet (once again from the document linked above):

    apiVersion: v1
    kind: Pod
    metadata:
      labels:
      test: liveness
    name: liveness-exec
    spec:
      containers:
      - name: liveness
        image: registry.k8s.io/busybox
        args:
        - /bin/sh
        - -c
        - touch /tmp/healthy; sleep 30; rm -f /tmp/healthy; sleep 600
        livenessProbe:
          exec:
            command:
            - cat
            - /tmp/healthy
          initialDelaySeconds: 5
          periodSeconds: 5
    

    Liveness probes are the best way to handle your specific question (a pod is dead, as the node has been removed from service). You can check for liveness using several probe types including exec (shown above), http, TCP, and GRPC queries.

    It is worth noting that your readinessProbe would detect the condition and stop routing traffic to your pods, but only after the detection period has elapsed.

    In your case, this will be

    (periodSeconds * failureThreshold)

    And as you have defined this as 10 seconds with 2 threshold, your system will only detect the pod unavailability at the earliest, in 20 seconds. If you expect high traffic and you need pods to be removed from service effectively, consider shortening your periodSeconds (at least).

    Note that depending on how many pods you have, this may put additional load on your Nodes as the Kubelet is doing more work to check the state of your pods. For small numbers of pods this should be fine.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search