skip to Main Content

I’ve done quite a bit of searching and cannot seem to find anyone that shows a resolution to this problem.

I’m getting intermittent 111 Connection refused errors on my kubernetes clusters. It seems that about 90% of my requests succeed and the other 10% fail. If you "refresh" the page, a previously failed request will then succeed. I have 2 different Kubernetes clusters with the same exact setup both showing the errors.

This looks to be very close to what I am experiencing. I did install my setup onto a new cluster, but the same problem persisted:
Kubernetes ClusterIP intermittent 502 connection refused

Setup

  • Kubernetes Cluster Version: 1.18.12-gke.1206
  • Django Version: 3.1.4
  • Helm to manage kubernetes charts

Cluster Setup

Kubernetes nginx ingress controller that serves web traffic into the cluster:
https://kubernetes.github.io/ingress-nginx/deploy/#gce-gke

From there I have 2 Ingresses defined that route traffic based on the referrer url.

  1. Stage Ingress
  2. Prod Ingress

Ingress

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: potr-tms-ingress-{{ .Values.environment }}
  namespace: {{ .Values.environment }}
  labels:
    app: potr-tms-{{ .Values.environment }}
  annotations:
    kubernetes.io/ingress.class: "nginx"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/from-to-www-redirect: "true"

# this line below doesn't seem to have an effect
#    nginx.ingress.kubernetes.io/service-upstream: "true"
    nginx.ingress.kubernetes.io/proxy-body-size: "100M"
    cert-manager.io/cluster-issuer: "letsencrypt-{{ .Values.environment }}"
spec:
  rules:
    - host: {{ .Values.ingress_host }}
      http:
        paths:
        - path: /
          backend:
            serviceName: potr-tms-service-{{ .Values.environment }}
            servicePort: 8000
  tls:
    - hosts:
      - {{ .Values.ingress_host }}
      - www.{{ .Values.ingress_host }}
      secretName: potr-tms-{{ .Values.environment }}-tls

These ingresses route to 2 services that I have defined for prod and stage:

Service

apiVersion: v1
kind: Service
metadata:
  name: potr-tms-service-{{ .Values.environment }}
  namespace: {{ .Values.environment }}
  labels:
    app: potr-tms-{{ .Values.environment }}
spec:
  type: ClusterIP
  ports:
    - name: potr-tms-service-{{ .Values.environment }}
      port: 8000
      protocol: TCP
      targetPort: 8000
  selector:
    app: potr-tms-{{ .Values.environment }}

These 2 services route to deployments that I have for both prod and stage:

Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: potr-tms-deployment-{{ .Values.environment }}
  namespace: {{ .Values.environment }}
  labels:
    app: potr-tms-{{ .Values.environment }}
spec:
  replicas: {{ .Values.deployment_replicas }}
  selector:
    matchLabels:
      app: potr-tms-{{ .Values.environment }}
  strategy:
    type: RollingUpdate
  template:
    metadata:
      annotations:
        rollme: {{ randAlphaNum 5 | quote }}
      labels:
        app: potr-tms-{{ .Values.environment }}
    spec:
      containers:
      - command: ["gunicorn", "--bind", ":8000", "config.wsgi"]
#      - command: ["python", "manage.py", "runserver", "0.0.0.0:8000"]
        envFrom:
          - secretRef:
              name: potr-tms-secrets-{{ .Values.environment }}
        image: gcr.io/potrtms/potr-tms-{{ .Values.environment }}:latest
        name: potr-tms-{{ .Values.environment }}
        ports:
        - containerPort: 8000
        resources:
          requests:
            cpu: 200m
            memory: 512Mi
      restartPolicy: Always
      serviceAccountName: "potr-tms-service-account-{{ .Values.environment }}"
status: {}

Error
This is the error that I’m seeing inside of my ingress controller logs:
Error log

This seems pretty clear, if my deployment pods were failing or showing errors they would be "unavailable" and the service would not be able to route them to the pod. To try and debug this I did increase my deployment resources and replica counts. The amount of web traffic to this app is pretty low though, ~10 users.

What I’ve Tried

  1. I tried using a completely different ingress controller https://github.com/kubernetes/ingress-nginx
  2. Increasing deployment resources / replica counts (seems to have no effect)
  3. Installing my whole setup on a brand new cluster (same results)
  4. restart the ingress controller / deleting and re installing
  5. Potentially it sounds like this could be a Gunicorn problem. To test I tried starting my pods with python manage.py runserver, problem remained.

Update

Raising the pod counts seems to have helped a little bit.

  • deployment replicas: 15
  • cpu request: 200m
  • memory request: 512Mi

Some requests do fail still though.

2

Answers


  1. Chosen as BEST ANSWER

    I was not able to figure out why these connection errors happen but I did find a work around that seems to solve the problem for our users.

    Inside of your ingress config add the annotation

    nginx.ingress.kubernetes.io/proxy-next-upstream-tries: "10"

    I set it to 10 just to make sure it retried as I was fairly confident our services were working. You could probably get away with 2 or 3.

    Here's my full ingress.yaml

    apiVersion: extensions/v1beta1
    kind: Ingress
    metadata:
      name: potr-tms-ingress-{{ .Values.environment }}
      namespace: {{ .Values.environment }}
      labels:
        app: potr-tms-{{ .Values.environment }}
      annotations:
        kubernetes.io/ingress.class: "nginx"
        nginx.ingress.kubernetes.io/ssl-redirect: "true"
        nginx.ingress.kubernetes.io/from-to-www-redirect: "true"
    #    nginx.ingress.kubernetes.io/service-upstream: "true"
        nginx.ingress.kubernetes.io/proxy-body-size: "100M"
        nginx.ingress.kubernetes.io/client-body-buffer-size: "100m"
        nginx.ingress.kubernetes.io/proxy-max-temp-file-size: "1024m"
        nginx.ingress.kubernetes.io/proxy-next-upstream-tries: "10"
        cert-manager.io/cluster-issuer: "letsencrypt-{{ .Values.environment }}"
    spec:
      rules:
        - host: {{ .Values.ingress_host }}
          http:
            paths:
            - path: /
              backend:
                serviceName: potr-tms-service-{{ .Values.environment }}
                servicePort: 8000
      tls:
        - hosts:
          - {{ .Values.ingress_host }}
          - www.{{ .Values.ingress_host }}
          secretName: potr-tms-{{ .Values.environment }}-tls
    

  2. Did you find a solution to this? I am seeing something very similar on a minikube setup.

    In my case, I believe I also see the nginx controller restarting after the 502. The 502 is intermittent, frequently the first access fails, then reload works.

    The best idea I’ve found so far is to increase the Nginx timeout parameter, but I have not tried that yet. Still trying to search out all options.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search