Redis - Kubernetes DNS Troubleshooting

SkywalkerIsNull
November 9, 2020
190 views
0 votes
2 Answers

I am trying to troubleshoot a DNS issue in our K8 cluster v1.19. There are 3 nodes (1 controller, 2 workers) all running vanilla Ubuntu 20.04 using Calico network with Metallb for inbound load balancing. This is all hosted on premise and has full access to the internet. There is also a proxy server (Traefik) in front of it that is handling the SSL to the K8 cluster and other services in the infrastructure.

This issue happened when I upgraded the helm chart for the pod that was/is connecting to the redis pod, but otherwise had been happy to run for the past 36 days.

In the log of one of the pods it is showing an error that it cannot determine where the redis pod(s) is/are:

2020-11-09 00:00:00 [1] [verbose]:      [Cache] Attempting connection to redis.
2020-11-09 00:00:00 [1] [verbose]:      [Cache] Successfully connected to redis.
2020-11-09 00:00:00 [1] [verbose]:      [PubSub] Attempting connection to redis.
2020-11-09 00:00:00 [1] [verbose]:      [PubSub] Successfully connected to redis.
2020-11-09 00:00:00 [1] [warn]:         Secret key is weak. Please consider lengthening it for better security.
2020-11-09 00:00:00 [1] [verbose]:      [Database] Connecting to database...
2020-11-09 00:00:00 [1] [info]:         [Database] Successfully connected .
2020-11-09 00:00:00 [1] [verbose]:      [Database] Ran 0 migration(s).
2020-11-09 00:00:00 [1] [verbose]:      Sending request for public key.
Error: getaddrinfo EAI_AGAIN oct-2020-redis-master
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:67:26) {
  errno: -3001,
  code: 'EAI_AGAIN',
  syscall: 'getaddrinfo',
  hostname: 'oct-2020-redis-master'
}
[ioredis] Unhandled error event: Error: getaddrinfo EAI_AGAIN oct-2020-redis-master
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:67:26)
Error: connect ETIMEDOUT
    at Socket.<anonymous> (/app/node_modules/ioredis/built/redis/index.js:307:37)
    at Object.onceWrapper (events.js:421:28)
    at Socket.emit (events.js:315:20)
    at Socket.EventEmitter.emit (domain.js:486:12)
    at Socket._onTimeout (net.js:483:8)
    at listOnTimeout (internal/timers.js:554:17)
    at processTimers (internal/timers.js:497:7) {
  errorno: 'ETIMEDOUT',
  code: 'ETIMEDOUT',
  syscall: 'connect'
}

I have gone through the steps outlined in https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/

ubuntu@k8-01:~$ kubectl exec -i -t dnsutils -- nslookup kubernetes.default
;; connection timed out; no servers could be reached

command terminated with exit code 1
ubuntu@k8-01:~$ kubectl get pods --namespace=kube-system -l k8s-app=kube-dns
NAME                      READY   STATUS    RESTARTS   AGE
coredns-f9fd979d6-lfm5t   1/1     Running   17         37d
coredns-f9fd979d6-sw2qp   1/1     Running   18         37d
ubuntu@k8-01:~$ kubectl logs --namespace=kube-system -l k8s-app=kube-dns
CoreDNS-1.7.0
linux/amd64, go1.14.4, f59c03d
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 5s
[INFO] plugin/reload: Running configuration MD5 = 3d3f6363f05ccd60e0f885f0eca6c5ff
[INFO] Reloading complete
[INFO] 10.244.210.238:34288 - 28733 "A IN oct-2020-redis-master.default.svc.cluster.local. udp 75 false 512" NOERROR qr,aa,rd 148 0.001300712s
[INFO] 10.244.210.238:44532 - 12032 "A IN oct-2020-redis-master.default.svc.cluster.local. udp 75 false 512" NOERROR qr,aa,rd 148 0.001279312s
[INFO] 10.244.210.235:44595 - 65094 "A IN oct-2020-redis-master.default.svc.cluster.local. udp 75 false 512" NOERROR qr,aa,rd 148 0.000163001s
[INFO] 10.244.210.235:55945 - 20758 "A IN oct-2020-redis-master.default.svc.cluster.local. udp 75 false 512" NOERROR qr,aa,rd 148 0.000141202s
ubuntu@k8-01:~$ kubectl get services --all-namespaces
NAMESPACE     NAME                                               TYPE           CLUSTER-IP       EXTERNAL-IP     PORT(S)                      AGE
default       oct-2020-api                                       ClusterIP      10.107.89.213    <none>          80/TCP                       37d
default       oct-2020-nginx-ingress-controller                  LoadBalancer   10.110.235.175   192.168.2.150   80:30194/TCP,443:31514/TCP   37d
default       oct-2020-nginx-ingress-default-backend             ClusterIP      10.98.147.246    <none>          80/TCP                       37d
default       oct-2020-redis-headless                            ClusterIP      None             <none>          6379/TCP                     37d
default       oct-2020-redis-master                              ClusterIP      10.109.58.236    <none>          6379/TCP                     37d
default       oct-2020-webclient                                 ClusterIP      10.111.204.251   <none>          80/TCP                       37d
default       kubernetes                                         ClusterIP      10.96.0.1        <none>          443/TCP                      37d
kube-system   coredns                                            NodePort       10.101.104.114   <none>          53:31245/UDP                 15h
kube-system   kube-dns                                           ClusterIP      10.96.0.10       <none>          53/UDP,53/TCP,9153/TCP       37d

When I enter the pod:

/app # grep "nameserver" /etc/resolv.conf
nameserver 10.96.0.10
/app # nslookup
BusyBox v1.31.1 () multi-call binary.

Usage: nslookup [-type=QUERY_TYPE] [-debug] HOST [DNS_SERVER]

Query DNS about HOST

QUERY_TYPE: soa,ns,a,aaaa,cname,mx,txt,ptr,any
/app # ping 10.96.0.10
PING 10.96.0.10 (10.96.0.10): 56 data bytes
^C
--- 10.96.0.10 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss
/app # nslookup oct-20-redis-master
;; connection timed out; no servers could be reached

Any ideas on troubleshooting would be greatly appreciated.

Answers

Chosen as BEST ANSWER
- SkywalkerIsNull
- November 10, 2020 at 4:08 am
- 0 votes
0
To answer my own question, I deleted the DNS pods and then it worked again. The command was the following:
```
kubectl delete pod coredns-f9fd979d6-sw2qp --namespace=kube-system
```
This doesn't get to the underlying problem of why this is happening, or why K8 isn't detecting that something is wrong with those pods and recreating them. I am going to keep digging into this and put some more instrumenting on the DNS pods to see what it actually is that is causing this problem.

If anyone has any ideas on instrumenting to hook up or look at specifically, that would be appreciated.

(Edit)

This is how we test dns

Create below deployment

apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  ports:
  - port: 80
    name: web
  clusterIP: None
  selector:
    app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
  labels:
    app: nginx
spec:
  serviceName: "nginx"
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: k8s.gcr.io/nginx-slim:0.8
        ports:
        - containerPort: 80
          name: web
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
      volumes:
      - name: www
        emptyDir:

Run the below tests

master $ kubectl get po
NAME      READY     STATUS    RESTARTS   AGE
web-0     1/1       Running   0          1m
web-1     1/1       Running   0          1m

master $ kubectl get svc
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP   35m
nginx        ClusterIP   None         <none>        80/TCP    2m

master $ kubectl run -i --tty --image busybox:1.28 dns-test --restart=Never --rm
If you don't see a command prompt, try pressing enter.
/ # nslookup nginx
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      nginx
Address 1: 10.40.0.1 web-0.nginx.default.svc.cluster.local
Address 2: 10.40.0.2 web-1.nginx.default.svc.cluster.local
/ #


/ # nslookup web-0.nginx
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      web-0.nginx
Address 1: 10.40.0.1 web-0.nginx.default.svc.cluster.local


/ # nslookup web-0.nginx.default.svc.cluster.local
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      web-0.nginx.default.svc.cluster.local
Address 1: 10.40.0.1 web-0.nginx.default.svc.cluster.local

Please signup or login to give your own answer.

Click here to cancel reply.

Redis – Kubernetes DNS Troubleshooting

Answers