I am experimenting with GKE cluster upgrades in a 6 nodes (in two node pools) test cluster before I try it on our staging or production cluster. Upgrading when I only had a 12 replicas nginx deployment, the nginx ingress controller and cert-manager (as helm chart) installed took 10 minutes per node pool (3 nodes). I was very satisfied. I decided to try again with something that looks more like our setup. I removed the nginx deploy and added 2 node.js deployments, the following helm charts: mongodb-0.4.27, mcrouter-0.1.0 (as a statefulset), redis-ha-2.0.0, and my own www-redirect-0.0.1 chart (simple nginx which does redirect). The problem seems to be with mcrouter. Once the node starts draining, the status of that node changes to Ready,SchedulingDisabled
(which seems normal) but the following pods remains:
- mcrouter-memcached-0
- fluentd-gcp-v2.0.9-4f87t
- kube-proxy-gke-test-upgrade-cluster-default-pool-74f8edac-wblf
I do not know why those two kube-system pods remains, but that mcrouter is mine and it won’t go quickly enough. If I wait long enough (1 hour+) then it eventually work, I am not sure why. The current node pool (of 3 nodes) started upgrading 2h46 minutes ago and 2 nodes are upgraded, the 3rd one is still upgrading but nothing is moving… I presume it will complete in the next 1-2 hours…
I tried to run the drain command with --ignore-daemonsets --force
but it told me it was already drained.
I tried to delete the pods, but they just come back and the upgrade does not move any faster.
Any thoughts?
Update #1
The mcrouter helm chart was installed like this:
helm install stable/mcrouter --name mcrouter --set controller=statefulset
The statefulsets it created for mcrouter part is:
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
labels:
app: mcrouter-mcrouter
chart: mcrouter-0.1.0
heritage: Tiller
release: mcrouter
name: mcrouter-mcrouter
spec:
podManagementPolicy: OrderedReady
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: mcrouter-mcrouter
chart: mcrouter-0.1.0
heritage: Tiller
release: mcrouter
serviceName: mcrouter-mcrouter
template:
metadata:
labels:
app: mcrouter-mcrouter
chart: mcrouter-0.1.0
heritage: Tiller
release: mcrouter
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: mcrouter-mcrouter
release: mcrouter
topologyKey: kubernetes.io/hostname
containers:
- args:
- -p 5000
- --config-file=/etc/mcrouter/config.json
command:
- mcrouter
image: jphalip/mcrouter:0.36.0
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: mcrouter-port
timeoutSeconds: 5
name: mcrouter-mcrouter
ports:
- containerPort: 5000
name: mcrouter-port
protocol: TCP
readinessProbe:
failureThreshold: 3
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: mcrouter-port
timeoutSeconds: 1
resources:
limits:
cpu: 256m
memory: 512Mi
requests:
cpu: 100m
memory: 128Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/mcrouter
name: config
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- configMap:
defaultMode: 420
name: mcrouter-mcrouter
name: config
updateStrategy:
type: OnDelete
and here is the memcached statefulset:
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
labels:
app: mcrouter-memcached
chart: memcached-1.2.1
heritage: Tiller
release: mcrouter
name: mcrouter-memcached
spec:
podManagementPolicy: OrderedReady
replicas: 5
revisionHistoryLimit: 10
selector:
matchLabels:
app: mcrouter-memcached
chart: memcached-1.2.1
heritage: Tiller
release: mcrouter
serviceName: mcrouter-memcached
template:
metadata:
labels:
app: mcrouter-memcached
chart: memcached-1.2.1
heritage: Tiller
release: mcrouter
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: mcrouter-memcached
release: mcrouter
topologyKey: kubernetes.io/hostname
containers:
- command:
- memcached
- -m 64
- -o
- modern
- -v
image: memcached:1.4.36-alpine
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: memcache
timeoutSeconds: 5
name: mcrouter-memcached
ports:
- containerPort: 11211
name: memcache
protocol: TCP
readinessProbe:
failureThreshold: 3
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: memcache
timeoutSeconds: 1
resources:
requests:
cpu: 50m
memory: 64Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
updateStrategy:
type: OnDelete
status:
replicas: 0
2
Answers
The problem was a combination of the minAvailable value from a PodDisruptionBudget (that was part of the memcached helm chart which is a dependency of the mcrouter helm chart) and the replicas value for the memcached replicaset. Both were set to 5 and therefore none of them could be deleted during the drain. I tried changing the minAvailable to 4 but PDB are immutable at this time. What I did was remove the helm chart and replace it.
Once that was done, I was able to get the cluster to upgrade normally.
What I should have done (but only thought about it after) was to change the replicas value to 6, this way I would not have needed to delete and replace the whole chart.
Thank you @AntonKostenko for trying to help me finding this issue. This issue also helped me. Thanks to the folks in Slack@Kubernetes, specially to Paris who tried to get my issue more visibility and the volonteers of the Kubernetes Office Hours (which happened to be yesterday, lucky me!) for also taking a look. Finally, thank you to psycotica0 from Kubernetes Canada to also give me some pointers.
That is a bit complex question and I am definitely not sure that it is like how I thinking, but… Let’s try to understand what is happening.
You have an upgrade process and have 6 nodes in the cluster. The system will upgrade it one by one using Drain to remove all workload from the pod.
Drain process itself respecting your settings and number of replicas and desired state of workload has higher priority than the drain of the node itself.
During the drain process, Kubernetes will try to schedule all your workload on resources where scheduling available. Scheduling on a node which system want to drain is disabled, you can see it in its state –
Ready,SchedulingDisabled
.So, Kubernetes scheduler trying to find a right place for your workload on all available nodes. It will wait as long as it needs to place everything you describe in a cluster configuration.
Now the most important thing. You set that you need
replicas: 5
for yourmcrouter-memcached
. It cannot run more than one replica per node because ofpodAntiAffinity
and a node for a running it should have enough resources for that, which is calculated usingresources:
block ofReplicaSet
.So, I think, that your cluster just does not has enough resource for a run new replica of
mcrouter-memcached
on the remaining 5 nodes. As an example, on the last node where a replica of it still not running, you have not enough memory because of other workloads.I think if you will set
replicaset
formcrouter-memcached
to 4, it will solve a problem. Or you can try to use a bit more powerful instances for that workload, or add one more node to the cluster, it also should help.Hope I gave enough explanation of my logic, ask me if something not clear to you. But first please try to solve an issue by provided solution:)