I am trying to troubleshoot a DNS issue in our K8 cluster v1.19. There are 3 nodes (1 controller, 2 workers) all running vanilla Ubuntu 20.04 using Calico network with Metallb for inbound load balancing. This is all hosted on premise and has full access to the internet. There is also a proxy server (Traefik) in front of it that is handling the SSL to the K8 cluster and other services in the infrastructure.
This issue happened when I upgraded the helm chart for the pod that was/is connecting to the redis pod, but otherwise had been happy to run for the past 36 days.
In the log of one of the pods it is showing an error that it cannot determine where the redis pod(s) is/are:
2020-11-09 00:00:00 [1] [verbose]: [Cache] Attempting connection to redis.
2020-11-09 00:00:00 [1] [verbose]: [Cache] Successfully connected to redis.
2020-11-09 00:00:00 [1] [verbose]: [PubSub] Attempting connection to redis.
2020-11-09 00:00:00 [1] [verbose]: [PubSub] Successfully connected to redis.
2020-11-09 00:00:00 [1] [warn]: Secret key is weak. Please consider lengthening it for better security.
2020-11-09 00:00:00 [1] [verbose]: [Database] Connecting to database...
2020-11-09 00:00:00 [1] [info]: [Database] Successfully connected .
2020-11-09 00:00:00 [1] [verbose]: [Database] Ran 0 migration(s).
2020-11-09 00:00:00 [1] [verbose]: Sending request for public key.
Error: getaddrinfo EAI_AGAIN oct-2020-redis-master
at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:67:26) {
errno: -3001,
code: 'EAI_AGAIN',
syscall: 'getaddrinfo',
hostname: 'oct-2020-redis-master'
}
[ioredis] Unhandled error event: Error: getaddrinfo EAI_AGAIN oct-2020-redis-master
at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:67:26)
Error: connect ETIMEDOUT
at Socket.<anonymous> (/app/node_modules/ioredis/built/redis/index.js:307:37)
at Object.onceWrapper (events.js:421:28)
at Socket.emit (events.js:315:20)
at Socket.EventEmitter.emit (domain.js:486:12)
at Socket._onTimeout (net.js:483:8)
at listOnTimeout (internal/timers.js:554:17)
at processTimers (internal/timers.js:497:7) {
errorno: 'ETIMEDOUT',
code: 'ETIMEDOUT',
syscall: 'connect'
}
I have gone through the steps outlined in https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/
ubuntu@k8-01:~$ kubectl exec -i -t dnsutils -- nslookup kubernetes.default
;; connection timed out; no servers could be reached
command terminated with exit code 1
ubuntu@k8-01:~$ kubectl get pods --namespace=kube-system -l k8s-app=kube-dns
NAME READY STATUS RESTARTS AGE
coredns-f9fd979d6-lfm5t 1/1 Running 17 37d
coredns-f9fd979d6-sw2qp 1/1 Running 18 37d
ubuntu@k8-01:~$ kubectl logs --namespace=kube-system -l k8s-app=kube-dns
CoreDNS-1.7.0
linux/amd64, go1.14.4, f59c03d
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 5s
[INFO] plugin/reload: Running configuration MD5 = 3d3f6363f05ccd60e0f885f0eca6c5ff
[INFO] Reloading complete
[INFO] 10.244.210.238:34288 - 28733 "A IN oct-2020-redis-master.default.svc.cluster.local. udp 75 false 512" NOERROR qr,aa,rd 148 0.001300712s
[INFO] 10.244.210.238:44532 - 12032 "A IN oct-2020-redis-master.default.svc.cluster.local. udp 75 false 512" NOERROR qr,aa,rd 148 0.001279312s
[INFO] 10.244.210.235:44595 - 65094 "A IN oct-2020-redis-master.default.svc.cluster.local. udp 75 false 512" NOERROR qr,aa,rd 148 0.000163001s
[INFO] 10.244.210.235:55945 - 20758 "A IN oct-2020-redis-master.default.svc.cluster.local. udp 75 false 512" NOERROR qr,aa,rd 148 0.000141202s
ubuntu@k8-01:~$ kubectl get services --all-namespaces
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default oct-2020-api ClusterIP 10.107.89.213 <none> 80/TCP 37d
default oct-2020-nginx-ingress-controller LoadBalancer 10.110.235.175 192.168.2.150 80:30194/TCP,443:31514/TCP 37d
default oct-2020-nginx-ingress-default-backend ClusterIP 10.98.147.246 <none> 80/TCP 37d
default oct-2020-redis-headless ClusterIP None <none> 6379/TCP 37d
default oct-2020-redis-master ClusterIP 10.109.58.236 <none> 6379/TCP 37d
default oct-2020-webclient ClusterIP 10.111.204.251 <none> 80/TCP 37d
default kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 37d
kube-system coredns NodePort 10.101.104.114 <none> 53:31245/UDP 15h
kube-system kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 37d
When I enter the pod:
/app # grep "nameserver" /etc/resolv.conf
nameserver 10.96.0.10
/app # nslookup
BusyBox v1.31.1 () multi-call binary.
Usage: nslookup [-type=QUERY_TYPE] [-debug] HOST [DNS_SERVER]
Query DNS about HOST
QUERY_TYPE: soa,ns,a,aaaa,cname,mx,txt,ptr,any
/app # ping 10.96.0.10
PING 10.96.0.10 (10.96.0.10): 56 data bytes
^C
--- 10.96.0.10 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss
/app # nslookup oct-20-redis-master
;; connection timed out; no servers could be reached
Any ideas on troubleshooting would be greatly appreciated.
2
Answers
To answer my own question, I deleted the DNS pods and then it worked again. The command was the following:
This doesn't get to the underlying problem of why this is happening, or why K8 isn't detecting that something is wrong with those pods and recreating them. I am going to keep digging into this and put some more instrumenting on the DNS pods to see what it actually is that is causing this problem.
If anyone has any ideas on instrumenting to hook up or look at specifically, that would be appreciated.
This is how we test dns
Create below deployment
Run the below tests