skip to Main Content

I am trying to troubleshoot a DNS issue in our K8 cluster v1.19. There are 3 nodes (1 controller, 2 workers) all running vanilla Ubuntu 20.04 using Calico network with Metallb for inbound load balancing. This is all hosted on premise and has full access to the internet. There is also a proxy server (Traefik) in front of it that is handling the SSL to the K8 cluster and other services in the infrastructure.

This issue happened when I upgraded the helm chart for the pod that was/is connecting to the redis pod, but otherwise had been happy to run for the past 36 days.

In the log of one of the pods it is showing an error that it cannot determine where the redis pod(s) is/are:

2020-11-09 00:00:00 [1] [verbose]:      [Cache] Attempting connection to redis.
2020-11-09 00:00:00 [1] [verbose]:      [Cache] Successfully connected to redis.
2020-11-09 00:00:00 [1] [verbose]:      [PubSub] Attempting connection to redis.
2020-11-09 00:00:00 [1] [verbose]:      [PubSub] Successfully connected to redis.
2020-11-09 00:00:00 [1] [warn]:         Secret key is weak. Please consider lengthening it for better security.
2020-11-09 00:00:00 [1] [verbose]:      [Database] Connecting to database...
2020-11-09 00:00:00 [1] [info]:         [Database] Successfully connected .
2020-11-09 00:00:00 [1] [verbose]:      [Database] Ran 0 migration(s).
2020-11-09 00:00:00 [1] [verbose]:      Sending request for public key.
Error: getaddrinfo EAI_AGAIN oct-2020-redis-master
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:67:26) {
  errno: -3001,
  code: 'EAI_AGAIN',
  syscall: 'getaddrinfo',
  hostname: 'oct-2020-redis-master'
}
[ioredis] Unhandled error event: Error: getaddrinfo EAI_AGAIN oct-2020-redis-master
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:67:26)
Error: connect ETIMEDOUT
    at Socket.<anonymous> (/app/node_modules/ioredis/built/redis/index.js:307:37)
    at Object.onceWrapper (events.js:421:28)
    at Socket.emit (events.js:315:20)
    at Socket.EventEmitter.emit (domain.js:486:12)
    at Socket._onTimeout (net.js:483:8)
    at listOnTimeout (internal/timers.js:554:17)
    at processTimers (internal/timers.js:497:7) {
  errorno: 'ETIMEDOUT',
  code: 'ETIMEDOUT',
  syscall: 'connect'
}

I have gone through the steps outlined in https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/

ubuntu@k8-01:~$ kubectl exec -i -t dnsutils -- nslookup kubernetes.default
;; connection timed out; no servers could be reached

command terminated with exit code 1
ubuntu@k8-01:~$ kubectl get pods --namespace=kube-system -l k8s-app=kube-dns
NAME                      READY   STATUS    RESTARTS   AGE
coredns-f9fd979d6-lfm5t   1/1     Running   17         37d
coredns-f9fd979d6-sw2qp   1/1     Running   18         37d
ubuntu@k8-01:~$ kubectl logs --namespace=kube-system -l k8s-app=kube-dns
CoreDNS-1.7.0
linux/amd64, go1.14.4, f59c03d
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 5s
[INFO] plugin/reload: Running configuration MD5 = 3d3f6363f05ccd60e0f885f0eca6c5ff
[INFO] Reloading complete
[INFO] 10.244.210.238:34288 - 28733 "A IN oct-2020-redis-master.default.svc.cluster.local. udp 75 false 512" NOERROR qr,aa,rd 148 0.001300712s
[INFO] 10.244.210.238:44532 - 12032 "A IN oct-2020-redis-master.default.svc.cluster.local. udp 75 false 512" NOERROR qr,aa,rd 148 0.001279312s
[INFO] 10.244.210.235:44595 - 65094 "A IN oct-2020-redis-master.default.svc.cluster.local. udp 75 false 512" NOERROR qr,aa,rd 148 0.000163001s
[INFO] 10.244.210.235:55945 - 20758 "A IN oct-2020-redis-master.default.svc.cluster.local. udp 75 false 512" NOERROR qr,aa,rd 148 0.000141202s
ubuntu@k8-01:~$ kubectl get services --all-namespaces
NAMESPACE     NAME                                               TYPE           CLUSTER-IP       EXTERNAL-IP     PORT(S)                      AGE
default       oct-2020-api                                       ClusterIP      10.107.89.213    <none>          80/TCP                       37d
default       oct-2020-nginx-ingress-controller                  LoadBalancer   10.110.235.175   192.168.2.150   80:30194/TCP,443:31514/TCP   37d
default       oct-2020-nginx-ingress-default-backend             ClusterIP      10.98.147.246    <none>          80/TCP                       37d
default       oct-2020-redis-headless                            ClusterIP      None             <none>          6379/TCP                     37d
default       oct-2020-redis-master                              ClusterIP      10.109.58.236    <none>          6379/TCP                     37d
default       oct-2020-webclient                                 ClusterIP      10.111.204.251   <none>          80/TCP                       37d
default       kubernetes                                         ClusterIP      10.96.0.1        <none>          443/TCP                      37d
kube-system   coredns                                            NodePort       10.101.104.114   <none>          53:31245/UDP                 15h
kube-system   kube-dns                                           ClusterIP      10.96.0.10       <none>          53/UDP,53/TCP,9153/TCP       37d

When I enter the pod:

/app # grep "nameserver" /etc/resolv.conf
nameserver 10.96.0.10
/app # nslookup
BusyBox v1.31.1 () multi-call binary.

Usage: nslookup [-type=QUERY_TYPE] [-debug] HOST [DNS_SERVER]

Query DNS about HOST

QUERY_TYPE: soa,ns,a,aaaa,cname,mx,txt,ptr,any
/app # ping 10.96.0.10
PING 10.96.0.10 (10.96.0.10): 56 data bytes
^C
--- 10.96.0.10 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss
/app # nslookup oct-20-redis-master
;; connection timed out; no servers could be reached

Any ideas on troubleshooting would be greatly appreciated.

2

Answers


  1. Chosen as BEST ANSWER

    To answer my own question, I deleted the DNS pods and then it worked again. The command was the following:

    kubectl delete pod coredns-f9fd979d6-sw2qp --namespace=kube-system
    

    This doesn't get to the underlying problem of why this is happening, or why K8 isn't detecting that something is wrong with those pods and recreating them. I am going to keep digging into this and put some more instrumenting on the DNS pods to see what it actually is that is causing this problem.

    If anyone has any ideas on instrumenting to hook up or look at specifically, that would be appreciated.


  2. This is how we test dns

    Create below deployment

    apiVersion: v1
    kind: Service
    metadata:
      name: nginx
      labels:
        app: nginx
    spec:
      ports:
      - port: 80
        name: web
      clusterIP: None
      selector:
        app: nginx
    ---
    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      name: web
      labels:
        app: nginx
    spec:
      serviceName: "nginx"
      replicas: 2
      selector:
        matchLabels:
          app: nginx
      template:
        metadata:
          labels:
            app: nginx
        spec:
          containers:
          - name: nginx
            image: k8s.gcr.io/nginx-slim:0.8
            ports:
            - containerPort: 80
              name: web
            volumeMounts:
            - name: www
              mountPath: /usr/share/nginx/html
          volumes:
          - name: www
            emptyDir:
    
    

    Run the below tests

    master $ kubectl get po
    NAME      READY     STATUS    RESTARTS   AGE
    web-0     1/1       Running   0          1m
    web-1     1/1       Running   0          1m
    
    master $ kubectl get svc
    NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
    kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP   35m
    nginx        ClusterIP   None         <none>        80/TCP    2m
    
    master $ kubectl run -i --tty --image busybox:1.28 dns-test --restart=Never --rm
    If you don't see a command prompt, try pressing enter.
    / # nslookup nginx
    Server:    10.96.0.10
    Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
    
    Name:      nginx
    Address 1: 10.40.0.1 web-0.nginx.default.svc.cluster.local
    Address 2: 10.40.0.2 web-1.nginx.default.svc.cluster.local
    / #
    
    
    / # nslookup web-0.nginx
    Server:    10.96.0.10
    Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
    
    Name:      web-0.nginx
    Address 1: 10.40.0.1 web-0.nginx.default.svc.cluster.local
    
    
    / # nslookup web-0.nginx.default.svc.cluster.local
    Server:    10.96.0.10
    Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
    
    Name:      web-0.nginx.default.svc.cluster.local
    Address 1: 10.40.0.1 web-0.nginx.default.svc.cluster.local
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search