I have set up a rancher 2.4 little infrastructure using k3s 1.18.4 and MariaDB as backend.
Everything ran ok for 25 days, but now, I can’t log on rancher anymore, it’s like the API is not responding anymore on 443, and container are now crashing since then can’t contact kubernetes cluster.
I have an Nginx load balancer on another server, that works fine for days also but now everything timeout :
==> https_lb.log <==
yyyy.yyy.yyy.yyy [18/Aug/2020:07:09:42 +0200] TCP 502 0 0 31.510 "xx.xx.xx.xx:443" "0" "0" "31.508"
==> error.log <==
2020/08/18 07:10:02 [error] 29966#29966: *81 connect() failed (110: Connection timed out) while connecting to upstream, client: yyyy.yyy.yyy.yyy, server: 0.0.0.0:443, upstream: "xx.xx.xx.xx:443", bytes from/to client:0/0, bytes from/to upstream:0/0
The API answer well on the 6443 port for example:
curl https://localhost:6443/ping --insecure
pong
The service seems up as well:
kubectl get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.43.0.1 <none> 443/TCP 7h53m
kubectl get nodes
NAME STATUS ROLES AGE VERSION
k3sh1 Ready master 32d v1.18.4+k3s1
k3sw4 Ready <none> 28d v1.18.6+k3s1
k3sw3 Ready <none> 28d v1.18.6+k3s1
k3sw2 Ready <none> 28d v1.18.6+k3s1
And of course, everythink is now KO, since there’s timeout everywhere, for example:
2020-08-17T22:40:04.421339807+02:00 stderr F ERROR: logging before flag.Parse: E0817 20:40:04.418868 1 reflector.go:251] github.com/kubernetes-incubator/external-storage/lib/controller/controller.go:603: Failed to watch *v1.PersistentVolumeClaim: Get https://10.43.0.1:443/api/v1/persistentvolumeclaims?resourceVersion=15748753&timeoutSeconds=499&watch=true: dial tcp 10.43.0.1:443: connect: connection refused
2020-08-17T22:40:04.421345285+02:00 stderr F ERROR: logging before flag.Parse: E0817 20:40:04.418809 1 reflector.go:251] github.com/kubernetes-incubator/external-storage/lib/controller/controller.go:609: Failed to watch *v1.StorageClass: Get https://10.43.0.1:443/apis/storage.k8s.io/v1/storageclasses?resourceVersion=15748753&timeoutSeconds=381&watch=true: dial tcp 10.43.0.1:443: connect: connection refused
And so on…
Can someone help me to look at the good direction to fix this, please?
Update 1 :
Liste of pods running
kubectl get pods --all-namespaces -owide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
recette app-test-65c94f999c-g6t56 1/1 Running 0 15d 10.42.1.6 k3sw3 <none> <none>
recette database-proftpd-7c598d6698-mtp5m 1/1 Running 0 3d9h 10.42.2.3 k3sw4 <none> <none>
recette redis-d84785cf5-9t7sj 1/1 Running 0 3d9h 10.42.1.7 k3sw3 <none> <none>
kube-system metrics-server-7566d596c8-tbfpp 1/1 Running 0 32h 10.42.1.15 k3sw3 <none> <none>
cert-manager cert-manager-webhook-746cf468-2t7c7 1/1 Running 0 8h 10.42.0.52 k3sh1 <none> <none>
cert-manager cert-manager-cainjector-5579468649-dj5fj 1/1 Running 0 8h 10.42.0.53 k3sh1 <none> <none>
cert-manager cert-manager-66bbb47c56-t5h6x 1/1 Running 0 8h 10.42.0.54 k3sh1 <none> <none>
kube-system local-path-provisioner-6d59f47c7-4vf2b 1/1 Running 0 8h 10.42.0.55 k3sh1 <none> <none>
kube-system coredns-8655855d6-lf2lt 1/1 Running 0 8h 10.42.0.56 k3sh1 <none> <none>
cattle-system rancher-c5766f5f9-vnrht 1/1 Running 0 8h 10.42.2.7 k3sw4 <none> <none>
cattle-system rancher-c5766f5f9-hqxvc 1/1 Running 0 8h 10.42.3.6 k3sw2 <none> <none>
recette database-7fc89fc4bc-5xr7m 1/1 Running 0 3d9h 10.42.2.4 k3sw4 <none> <none>
cattle-system rancher-c5766f5f9-n8fnm 1/1 Running 0 8h 10.42.0.57 k3sh1 <none> <none>
kube-system traefik-758cd5fc85-2vdqr 1/1 Running 0 8h 10.42.1.18 k3sw3 <none> <none>
cattle-system cattle-node-agent-6lrfg 0/1 CrashLoopBackOff 359 32h some_public_ip k3sw4 <none> <none>
cattle-system cattle-node-agent-gd2mh 0/1 CrashLoopBackOff 181 15h some_other_public_ip k3sw3 <none> <none>
cattle-system cattle-node-agent-67bqb 0/1 CrashLoopBackOff 364 32h some_other_public_ip k3sh1 <none> <none>
cattle-system cattle-node-agent-vvfwm 0/1 Error 361 32h some_other_public_ip k3sw2 <none> <none>
cattle-system cattle-cluster-agent-74b5586858-jnbv2 1/1 Running 100 8h 10.42.1.19 k3sw3 <none> <none>
And in my rancher pod i can see they can’t talk each other :
kubectl logs -f -n cattle-system rancher-c5766f5f9-vnrht
2020/08/19 05:18:36 [ERROR] Failed to connect to peer wss://10.42.3.6/v3/connect [local ID=10.42.2.7]: dial tcp 10.42.3.6:443: i/o timeout
2020/08/19 05:18:37 [ERROR] Failed to connect to peer wss://10.42.0.57/v3/connect [local ID=10.42.2.7]: dial tcp 10.42.0.57:443: i/o timeout
2020/08/19 05:18:51 [ERROR] Failed to connect to peer wss://10.42.3.6/v3/connect [local ID=10.42.2.7]: dial tcp 10.42.3.6:443: i/o timeout
2020/08/19 05:18:52 [ERROR] Failed to connect to peer wss://10.42.0.57/v3/connect [local ID=10.42.2.7]: dial tcp 10.42.0.57:443: i/o timeout
2020/08/19 05:19:06 [ERROR] Failed to connect to peer wss://10.42.3.6/v3/connect [local ID=10.42.2.7]: dial tcp 10.42.3.6:443: i/o timeout
2020/08/19 05:19:07 [ERROR] Failed to connect to peer wss://10.42.0.57/v3/connect [local ID=10.42.2.7]: dial tcp 10.42.0.57:443: i/o timeout
Update 2 :
I managed to fix the dial error by reseting all iptables rules, then restart k3s. I have now this error that block rancher to start:
E0819 06:35:44.274663 7 reflector.go:178] github.com/rancher/steve/pkg/clustercache/controller.go:164: Failed to list *summary.SummarizedObject: conversion webhook for cert-manager.io/v1alpha2, Kind=CertificateRequest failed: Post https://cert-manager-webhook.cert-manager.svc:443/convert?timeout=30s: service "cert-manager-webhook" not found │
│ E0819 06:35:45.324406 7 reflector.go:178] github.com/rancher/steve/pkg/clustercache/controller.go:164: Failed to list *summary.SummarizedObject: conversion webhook for acme.cert-manager.io/v1alpha2, Kind=Order failed: Post https://cert-manager-webhook.cert-manager.svc:443/convert?timeout=30s: service "cert-manager-webhook" not found │
│ E0819 06:35:49.022959 7 reflector.go:178] github.com/rancher/steve/pkg/clustercache/controller.go:164: Failed to list *summary.SummarizedObject: conversion webhook for cert-manager.io/v1alpha2, Kind=Certificate failed: Post https://cert-manager-webhook.cert-manager.svc:443/convert?timeout=30s: service "cert-manager-webhook" not found
2
Answers
Reinstalled cert-manager, and issued a self signed certificate to fix this issue.
Posting this as Community Wiki for better visibility.
K3s is
Lightweight Kubernetes
and situation described here is very similar toKubeadm
behavior after reset.Meaning, when you use
reset
it will delete many configurations and alsocertificates
. As I mentioned in comments that issue was related with lack of certificates:as
reset
command removed it.Solution was reinstall Cert-Manager and recreate certificate.