I’m trying to spread my ingress-nginx-controller
pods such that:
- Each availability zone has the same # of pods (+- 1).
- Pods prefer Nodes that currently run the least pods.
Following other questions here, I have set up Pod Topology Spread Constraints in my pod deployment:
replicas: 4
topologySpreadConstraints:
- labelSelector:
matchLabels:
app.kubernetes.io/name: ingress-nginx
maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
- labelSelector:
matchLabels:
app.kubernetes.io/name: ingress-nginx
maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
I currently have 2 Nodes, each in a different availability zone:
$ kubectl get nodes --label-columns=topology.kubernetes.io/zone,kubernetes.io/hostname
NAME STATUS ROLES AGE VERSION ZONE HOSTNAME
ip-{{node1}}.compute.internal Ready node 136m v1.20.2 us-west-2a ip-{{node1}}.compute.internal
ip-{{node2}}.compute.internal Ready node 20h v1.20.2 us-west-2b ip-{{node2}}.compute.internal
After running kubectl rollout restart
for that deployment, I get 3 pods in one Node, and 1 pod in the other, which has a skew of 2 > 1
:
$ kubectl describe pod ingress-nginx-controller -n ingress-nginx | grep 'Node:'
Node: ip-{{node1}}.compute.internal/{{node1}}
Node: ip-{{node2}}.compute.internal/{{node2}}
Node: ip-{{node1}}.compute.internal/{{node1}}
Node: ip-{{node1}}.compute.internal/{{node1}}
Why is my constraint not respected? How can I debug the pod scheduler?
My kubectl version:
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.0-beta.0.607+269d62d895c297", GitCommit:"269d62d895c29743931bfaaec6e8d37ced43c35f", GitTreeState:"clean", BuildDate:"2021-03-05T22:28:02Z", GoVersion:"go1.16", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-13T13:20:00Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
4
Answers
Giving more visibility on the comment:
Please consider this answer as a workaround:
An example of it could be following:
This definition will spawn a
Pod
on eachNode
in the cluster. You can further limit thePod
scheduling by specifying anodeSelector
.Assuming that you have some controller/logic responsible for tagging the nodes with a specific label you can schedule
Pods
on specific Nodes. The part responsible for it is commented out in above manifest:The nodes (
raven-sgdm
andraven-xvvw
are labeled): :The
Daemonset
:Additional resources:
At certain moment in time the skew will be proper.
But when the pods that are to be removed will be removed the skew might be skewed 🙂
Essentially you are facing the limitation described here:
One simple workaround I apply is scale it down + restart/deploy + scale it up.
Then the skew is perfectly fine!
kubectl rollout restart
spins up new pods and then terminate old pods after all the new pods are up and running.From the pod topology spread constraints known limitations section, constraints don’t remain satisfied when pods are removed and the recommended mitigation is now to use Descheduler , which you already seemed to have been using from your comment.
In Kubernetes 1.25, you can now use the alpha feature
matchLabelKeys
to resolve this issue. Because there is an automatically generatedpod-template-hash
for each version of a deployment added to each pod, that and your app specific label can provide the hashed value to prevent mis-scheduling.From the docs