skip to Main Content

I have set up a rancher 2.4 little infrastructure using k3s 1.18.4 and MariaDB as backend.

Everything ran ok for 25 days, but now, I can’t log on rancher anymore, it’s like the API is not responding anymore on 443, and container are now crashing since then can’t contact kubernetes cluster.

I have an Nginx load balancer on another server, that works fine for days also but now everything timeout :

==> https_lb.log <==

yyyy.yyy.yyy.yyy [18/Aug/2020:07:09:42 +0200] TCP 502 0 0 31.510 "xx.xx.xx.xx:443" "0" "0" "31.508"

==> error.log <==

2020/08/18 07:10:02 [error] 29966#29966: *81 connect() failed (110: Connection timed out) while connecting to upstream, client: yyyy.yyy.yyy.yyy, server: 0.0.0.0:443, upstream: "xx.xx.xx.xx:443", bytes from/to client:0/0, bytes from/to upstream:0/0

The API answer well on the 6443 port for example:

curl https://localhost:6443/ping --insecure
pong

The service seems up as well:

kubectl get services
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.43.0.1    <none>        443/TCP   7h53m

kubectl get nodes
NAME    STATUS   ROLES    AGE   VERSION
k3sh1   Ready    master   32d   v1.18.4+k3s1
k3sw4   Ready    <none>   28d   v1.18.6+k3s1
k3sw3   Ready    <none>   28d   v1.18.6+k3s1
k3sw2   Ready    <none>   28d   v1.18.6+k3s1

And of course, everythink is now KO, since there’s timeout everywhere, for example:

2020-08-17T22:40:04.421339807+02:00 stderr F ERROR: logging before flag.Parse: E0817 20:40:04.418868       1 reflector.go:251] github.com/kubernetes-incubator/external-storage/lib/controller/controller.go:603: Failed to watch *v1.PersistentVolumeClaim: Get https://10.43.0.1:443/api/v1/persistentvolumeclaims?resourceVersion=15748753&timeoutSeconds=499&watch=true: dial tcp 10.43.0.1:443: connect: connection refused
2020-08-17T22:40:04.421345285+02:00 stderr F ERROR: logging before flag.Parse: E0817 20:40:04.418809       1 reflector.go:251] github.com/kubernetes-incubator/external-storage/lib/controller/controller.go:609: Failed to watch *v1.StorageClass: Get https://10.43.0.1:443/apis/storage.k8s.io/v1/storageclasses?resourceVersion=15748753&timeoutSeconds=381&watch=true: dial tcp 10.43.0.1:443: connect: connection refused

And so on…

Can someone help me to look at the good direction to fix this, please?

Update 1 :

Liste of pods running

kubectl get pods --all-namespaces -owide
NAMESPACE       NAME                                       READY   STATUS             RESTARTS   AGE    IP               NODE    NOMINATED NODE   READINESS GATES
recette         app-test-65c94f999c-g6t56                  1/1     Running            0          15d    10.42.1.6        k3sw3   <none>           <none>
recette         database-proftpd-7c598d6698-mtp5m          1/1     Running            0          3d9h   10.42.2.3        k3sw4   <none>           <none>
recette         redis-d84785cf5-9t7sj                      1/1     Running            0          3d9h   10.42.1.7        k3sw3   <none>           <none>
kube-system     metrics-server-7566d596c8-tbfpp            1/1     Running            0          32h    10.42.1.15       k3sw3   <none>           <none>
cert-manager    cert-manager-webhook-746cf468-2t7c7        1/1     Running            0          8h     10.42.0.52       k3sh1   <none>           <none>
cert-manager    cert-manager-cainjector-5579468649-dj5fj   1/1     Running            0          8h     10.42.0.53       k3sh1   <none>           <none>
cert-manager    cert-manager-66bbb47c56-t5h6x              1/1     Running            0          8h     10.42.0.54       k3sh1   <none>           <none>
kube-system     local-path-provisioner-6d59f47c7-4vf2b     1/1     Running            0          8h     10.42.0.55       k3sh1   <none>           <none>
kube-system     coredns-8655855d6-lf2lt                    1/1     Running            0          8h     10.42.0.56       k3sh1   <none>           <none>
cattle-system   rancher-c5766f5f9-vnrht                    1/1     Running            0          8h     10.42.2.7        k3sw4   <none>           <none>
cattle-system   rancher-c5766f5f9-hqxvc                    1/1     Running            0          8h     10.42.3.6        k3sw2   <none>           <none>
recette         database-7fc89fc4bc-5xr7m                  1/1     Running            0          3d9h   10.42.2.4        k3sw4   <none>           <none>
cattle-system   rancher-c5766f5f9-n8fnm                    1/1     Running            0          8h     10.42.0.57       k3sh1   <none>           <none>
kube-system     traefik-758cd5fc85-2vdqr                   1/1     Running            0          8h     10.42.1.18       k3sw3   <none>           <none>
cattle-system   cattle-node-agent-6lrfg                    0/1     CrashLoopBackOff   359        32h    some_public_ip     k3sw4   <none>           <none>
cattle-system   cattle-node-agent-gd2mh                    0/1     CrashLoopBackOff   181        15h    some_other_public_ip   k3sw3   <none>           <none>
cattle-system   cattle-node-agent-67bqb                    0/1     CrashLoopBackOff   364        32h    some_other_public_ip     k3sh1   <none>           <none>
cattle-system   cattle-node-agent-vvfwm                    0/1     Error              361        32h    some_other_public_ip   k3sw2   <none>           <none>
cattle-system   cattle-cluster-agent-74b5586858-jnbv2      1/1     Running            100        8h     10.42.1.19       k3sw3   <none>           <none>

And in my rancher pod i can see they can’t talk each other :

kubectl logs -f -n cattle-system rancher-c5766f5f9-vnrht
2020/08/19 05:18:36 [ERROR] Failed to connect to peer wss://10.42.3.6/v3/connect [local ID=10.42.2.7]: dial tcp 10.42.3.6:443: i/o timeout
2020/08/19 05:18:37 [ERROR] Failed to connect to peer wss://10.42.0.57/v3/connect [local ID=10.42.2.7]: dial tcp 10.42.0.57:443: i/o timeout
2020/08/19 05:18:51 [ERROR] Failed to connect to peer wss://10.42.3.6/v3/connect [local ID=10.42.2.7]: dial tcp 10.42.3.6:443: i/o timeout
2020/08/19 05:18:52 [ERROR] Failed to connect to peer wss://10.42.0.57/v3/connect [local ID=10.42.2.7]: dial tcp 10.42.0.57:443: i/o timeout
2020/08/19 05:19:06 [ERROR] Failed to connect to peer wss://10.42.3.6/v3/connect [local ID=10.42.2.7]: dial tcp 10.42.3.6:443: i/o timeout
2020/08/19 05:19:07 [ERROR] Failed to connect to peer wss://10.42.0.57/v3/connect [local ID=10.42.2.7]: dial tcp 10.42.0.57:443: i/o timeout

Update 2 :

I managed to fix the dial error by reseting all iptables rules, then restart k3s. I have now this error that block rancher to start:

E0819 06:35:44.274663       7 reflector.go:178] github.com/rancher/steve/pkg/clustercache/controller.go:164: Failed to list *summary.SummarizedObject: conversion webhook for cert-manager.io/v1alpha2, Kind=CertificateRequest failed: Post https://cert-manager-webhook.cert-manager.svc:443/convert?timeout=30s: service "cert-manager-webhook" not found                                                                   │
│ E0819 06:35:45.324406       7 reflector.go:178] github.com/rancher/steve/pkg/clustercache/controller.go:164: Failed to list *summary.SummarizedObject: conversion webhook for acme.cert-manager.io/v1alpha2, Kind=Order failed: Post https://cert-manager-webhook.cert-manager.svc:443/convert?timeout=30s: service "cert-manager-webhook" not found                                                                           │
│ E0819 06:35:49.022959       7 reflector.go:178] github.com/rancher/steve/pkg/clustercache/controller.go:164: Failed to list *summary.SummarizedObject: conversion webhook for cert-manager.io/v1alpha2, Kind=Certificate failed: Post https://cert-manager-webhook.cert-manager.svc:443/convert?timeout=30s: service "cert-manager-webhook" not found  

2

Answers


  1. Chosen as BEST ANSWER

    Reinstalled cert-manager, and issued a self signed certificate to fix this issue.


  2. Posting this as Community Wiki for better visibility.

    K3s is Lightweight Kubernetes and situation described here is very similar to Kubeadm behavior after reset.

    kubeadm reset is responsible for cleaning up a node local file system from files that were created using the kubeadm init or kubeadm join commands. For control-plane nodes reset also removes the local stacked etcd member of this node from the etcd cluster and also removes this node’s information from the kubeadm ClusterStatus object. ClusterStatus is a kubeadm managed Kubernetes API object that holds a list of kube-apiserver endpoints.

    Meaning, when you use reset it will delete many configurations and also certificates. As I mentioned in comments that issue was related with lack of certificates:

    Failed to list *summary.SummarizedObject: conversion webhook for acme.cert-manager.io/v1alpha2, Kind=Order failed: Post https://cert-manager-webhook.cert-manager.svc:443/convert?timeout=30s: service "cert-manager-webhook" not found
    

    as reset command removed it.

    Solution was reinstall Cert-Manager and recreate certificate.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search