skip to Main Content

I have a multi-container pod running on AWS EKS. One web app container running on port 80 and a Redis container running on port 6379.

Once the deployment goes through, manual curl probes on the pod’s IP address:port from within the cluster are always good responses.
The ingress to service is fine as well.

However, the kubelet’s probes are failing, leading to restarts and I’m not sure how to replicate that probe fail nor fix it yet.

Thanks for reading!

Here are the events:

0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Readiness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Liveness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Readiness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Readiness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Readiness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Liveness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Readiness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Liveness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Liveness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Normal    Killing                  pod/app-7cddfb865b-gsvbg                                   Container app failed liveness probe, will be restarted
0s          Normal    Pulling                  pod/app-7cddfb865b-gsvbg                                   Pulling image "registry/app:latest"
0s          Normal    Pulled                   pod/app-7cddfb865b-gsvbg                                   Successfully pulled image "registry/app:latest"
0s          Normal    Created                  pod/app-7cddfb865b-gsvbg                                   Created container app

Making things generic, this is my deployment yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "16"
  creationTimestamp: "2021-05-26T22:01:19Z"
  generation: 19
  labels:
    app: app
    chart: app-1.0.0
    environment: production
    heritage: Helm
    owner: acme
    release: app
  name: app
  namespace: default
  resourceVersion: "234691173"
  selfLink: /apis/apps/v1/namespaces/default/deployments/app
  uid: 3149acc2-031e-4719-89e6-abafb0bcdc3c
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: app
      release: app
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 100%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        kubectl.kubernetes.io/restartedAt: "2021-09-17T09:04:49-07:00"
      creationTimestamp: null
      labels:
        app: app
        environment: production
        owner: acme
        release: app
    spec:
      containers:
        - image: redis:5.0.6-alpine
          imagePullPolicy: IfNotPresent
          name: redis
          ports:
            - containerPort: 6379
              hostPort: 6379
              name: redis
              protocol: TCP
          resources:
            limits:
              cpu: 500m
              memory: 500Mi
            requests:
              cpu: 500m
              memory: 500Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
        - env:
            - name: SYSTEM_ENVIRONMENT
              value: production
          envFrom:
            - configMapRef:
                name: app-production
            - secretRef:
                name: app-production
          image: registry/app:latest
          imagePullPolicy: Always
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /
              port: 80
              scheme: HTTP
            initialDelaySeconds: 90
            periodSeconds: 20
            successThreshold: 1
            timeoutSeconds: 1
          name: app
          ports:
            - containerPort: 80
              hostPort: 80
              name: app
              protocol: TCP
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /
              port: 80
              scheme: HTTP
            initialDelaySeconds: 90
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          resources:
            limits:
              cpu: "1"
              memory: 500Mi
            requests:
              cpu: "1"
              memory: 500Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      priorityClassName: critical-app
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
status:
  availableReplicas: 1
  conditions:
    - lastTransitionTime: "2021-08-10T17:34:18Z"
      lastUpdateTime: "2021-08-10T17:34:18Z"
      message: Deployment has minimum availability.
      reason: MinimumReplicasAvailable
      status: "True"
      type: Available
    - lastTransitionTime: "2021-05-26T22:01:19Z"
      lastUpdateTime: "2021-09-17T16:48:54Z"
      message: ReplicaSet "app-7f7cb8fd4" has successfully progressed.
      reason: NewReplicaSetAvailable
      status: "True"
      type: Progressing
  observedGeneration: 19
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

This is my service yaml:

apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2021-05-05T20:11:33Z"
  labels:
    app: app
    chart: app-1.0.0
    environment: production
    heritage: Helm
    owner: acme
    release: app
  name: app
  namespace: default
  resourceVersion: "163989104"
  selfLink: /api/v1/namespaces/default/services/app
  uid: 1f54cd2f-b978-485e-a1af-984ffeeb7db0
spec:
  clusterIP: 172.20.184.161
  externalTrafficPolicy: Cluster
  ports:
    - name: http
      nodePort: 32648
      port: 80
      protocol: TCP
      targetPort: 80
  selector:
    app: app
    release: app
  sessionAffinity: None
  type: NodePort
status:
  loadBalancer: {}

Update 10/20/2021:

So I went with the advice to tinker the readiness probe with these generous settings:

readinessProbe:
  failureThreshold: 3
  httpGet:
    path: /
    port: 80
    scheme: HTTP
  initialDelaySeconds: 300
  periodSeconds: 10
  successThreshold: 1
  timeoutSeconds: 10

These are the events:

5m21s       Normal    Scheduled                pod/app-686494b58b-6cjsq                                   Successfully assigned default/app-686494b58b-6cjsq to ip-10-10-14-127.compute.internal
5m20s       Normal    Created                  pod/app-686494b58b-6cjsq                                   Created container redis
5m20s       Normal    Started                  pod/app-686494b58b-6cjsq                                   Started container redis
5m20s       Normal    Pulling                  pod/app-686494b58b-6cjsq                                   Pulling image "registry/app:latest"
5m20s       Normal    Pulled                   pod/app-686494b58b-6cjsq                                   Successfully pulled image "registry/app:latest"
5m20s       Normal    Created                  pod/app-686494b58b-6cjsq                                   Created container app
5m20s       Normal    Pulled                   pod/app-686494b58b-6cjsq                                   Container image "redis:5.0.6-alpine" already present on machine
5m19s       Normal    Started                  pod/app-686494b58b-6cjsq                                   Started container app
0s          Warning   Unhealthy                pod/app-686494b58b-6cjsq                                   Readiness probe failed: Get http://10.10.14.117:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

I see the readiness probe kicking into action though when I actually request the health check page (root page) manually, which is odd. But be that is it may, the probe failure is not for the containers not running fine — they are — but somewhere else.

2

Answers


  1. Let’s go over your probes so you can understand what is going and might find a way to fix it:

    
    ### Readiness probe - "waiting" for the container to be ready
    ### to get to work.
    ###
    
    ### Liveness is executed once the pod is running which means that
    ### you have passed the readinessProbe so you might want to start
    ### with the readinessProbe first
    
    
    livenessProbe:
    
      ### - Define how many retries to test the URL before restarting the pod.
      ### Try to increase this number and once your pod is restarted reduce
      ### it back to a lower value
      failureThreshold: 3
        httpGet:
          path: /
          port: 80
          scheme: HTTP
        ###
        ### Delay before executing the first test
        ### As before - try to increase the delay and reduce it 
        ### back when you figured out the correct value
        ###
        initialDelaySeconds: 90
    
        ### How often (in seconds) to perform the test.
        periodSeconds: 20
        successThreshold: 1
    
        ### Number of seconds after which the probe times out.
        ### Since the value is 1 I assume that you did not change it.
        ### Same as before - increase the value and figure out what
        ### the current value
        timeoutSeconds: 1
    
    
    ### Same comments as above + `initialDelaySeconds`
    ### Readiness is "waiting" for the container to be ready to
    ### get to work.
    
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /
        port: 80
        scheme: HTTP
    
      ### Again, nothing new here, same comments to increase the value
      ### and then reduce it until you figure out what is desired value
      ### for this probe
      initialDelaySeconds: 90
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    

    enter image description here


    View the logs/events

    • If you are not sure that the probes are the root cause, view the logs and the events to figure out what is the root cause for those failures
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search