I had setup a working Docker Swarm cluster, but after several months I am trying to get back to using this cluster and I noticed nothing works.
Upon troubleshooting to find out what is going on, I found this error.
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: error
NodeID:
Error: error while loading TLS certificate in /var/lib/docker/swarm/certificates/swarm-node.crt: certificate (1 - s3htdkgcv9qifg2jmbpud1gt7) not valid after Sun, 27 Mar 2022 10:27:00 UTC, and it is currently Sun, 19 Jun 2022 04:33:54 UTC: x509: certificate has expired or is not yet valid:
Is Manager: false
Node Address: 10.10.1.10
I have tried what I found online like here https://stackoverflow.com/a/59086699/5442187
docker swarm leave
and then tried to rejoin
docker swarm join-token manager
=>
Error response from daemon: This node is not a swarm manager. Use
"docker swarm init" or "docker swarm join" to connect this node to
swarm and try again.
And
docker swarm join-token worker
=>
Error response from daemon: This node is not a swarm manager. Use
"docker swarm init" or "docker swarm join" to connect this node to
swarm and try again.
How do I re-join/re-claim this cluster back? I will expect it should be possible else this will make Docker Swarm a no go for production.
3
Answers
Rotate the swarm CA via
docker swarm ca --rotate
.The root CA rotation will not be completed until all registered nodes have rotated their TLS certificates. If the rotation is not completing within a reasonable amount of time, try running
docker node ls --format '{{.ID}} {{.Hostname}} {{.Status}} {{.TLSStatus}}'
to see if any nodes are down or otherwise unable to rotate TLS certificates.See https://docs.docker.com/engine/reference/commandline/swarm_ca/
Once all managers have left the cluster, I believe it is gone. Before then you could have run the following on one of the managers:
Now that they’ve all left, you can recreate the cluster from scratch:
Once you have a new cluster, on the manager run:
Then run the output of the join-token command above on the other nodes to join to the cluster.
There’s a way to recover, without losing the deployed swarm services/stacks.
The error was complaining "certificate not valid after Sun, 27 Mar 2022 10:27:00 UTC". So we should let the certificate valid first, then recover the swarm services, and rotate the CA certificate when swarm is up and running:
stop docker service:
service docker stop
set date back to "27 Mar 2022 10:27:00", could be more earlier:
date -s "27 Mar 2022 10:27:00"
Bring up the swarm services:
service docker start
#check if all the services are up and running
docker stack ls
Rotate the certificate:
docker swarm ca –rotate
Set system date to current:
date -s "19 Apr 2023 06:34:00"