K8s Pod Topology Spread is not respected after rollout? - Nginx

roim
March 6, 2021
155 views
2 votes
4 Answers

I’m trying to spread my ingress-nginx-controller pods such that:

Each availability zone has the same # of pods (+- 1).
Pods prefer Nodes that currently run the least pods.

Following other questions here, I have set up Pod Topology Spread Constraints in my pod deployment:

      replicas: 4
      topologySpreadConstraints:
      - labelSelector:
          matchLabels:
            app.kubernetes.io/name: ingress-nginx
        maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
      - labelSelector:
          matchLabels:
            app.kubernetes.io/name: ingress-nginx
        maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule

I currently have 2 Nodes, each in a different availability zone:

$ kubectl get nodes --label-columns=topology.kubernetes.io/zone,kubernetes.io/hostname
NAME                            STATUS   ROLES                  AGE    VERSION   ZONE         HOSTNAME
ip-{{node1}}.compute.internal   Ready    node                   136m   v1.20.2   us-west-2a   ip-{{node1}}.compute.internal
ip-{{node2}}.compute.internal   Ready    node                   20h    v1.20.2   us-west-2b   ip-{{node2}}.compute.internal

After running kubectl rollout restart for that deployment, I get 3 pods in one Node, and 1 pod in the other, which has a skew of 2 > 1:

$ kubectl describe pod ingress-nginx-controller -n ingress-nginx | grep 'Node:'
Node:         ip-{{node1}}.compute.internal/{{node1}}
Node:         ip-{{node2}}.compute.internal/{{node2}}
Node:         ip-{{node1}}.compute.internal/{{node1}}
Node:         ip-{{node1}}.compute.internal/{{node1}}

Why is my constraint not respected? How can I debug the pod scheduler?

My kubectl version:

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.0-beta.0.607+269d62d895c297", GitCommit:"269d62d895c29743931bfaaec6e8d37ced43c35f", GitTreeState:"clean", BuildDate:"2021-03-05T22:28:02Z", GoVersion:"go1.16", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-13T13:20:00Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}

Tags: kubernetes

Answers

- DawidKruk
- April 1, 2021 at 6:56 pm
- 0 votes
0
Giving more visibility on the comment:

Daemonset worked and was easy enough. It won’t work for our deployments with several pods per node, but there are mitigations there (descheduler) and it should self resolve as the cluster grows.

Please consider this answer as a workaround:

A DaemonSet ensures that all (or some) Nodes run a copy of a Pod. As nodes are added to the cluster, Pods are added to them. As nodes are removed from the cluster, those Pods are garbage collected. Deleting a DaemonSet will clean up the Pods it created.

— Kubernetes.io: Docs: Concepts: Workloads: Controllers: Daemonset

An example of it could be following:
```
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nginx
spec:
  selector:
    matchLabels:
      name: nginx
  template:
    metadata:
      labels:
        name: nginx 
    spec:
      #nodeSelector: 
          #schedule: here 
      tolerations:
      # this toleration is to have the daemonset runnable on master nodes
      # remove it if your masters can't run pods
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      containers:
      - name: nginx
        image: nginx
```
This definition will spawn a Pod on each Node in the cluster. You can further limit the Pod scheduling by specifying a nodeSelector.

Assuming that you have some controller/logic responsible for tagging the nodes with a specific label you can schedule Pods on specific Nodes. The part responsible for it is commented out in above manifest:
```
      nodeSelector: 
          schedule: here 
```
The nodes (raven-sgdm and raven-xvvw are labeled): :
```
NAME         STATUS   ROLES    AGE    VERSION
raven-6k6m   Ready    <none>   159m   v1.20
raven-sgdm   Ready    <none>   159m   v1.20
raven-xvvw   Ready    <none>   159m   v1.20
```
The Daemonset:
```
NAME    DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
nginx   2         2         2       2            2           schedule=here   99m
```
Additional resources:
- Kubernetes.io: Docs: Concepts: Workloads: Pods: Pod topology spread constraints
- Github.com: Kubernetes: Issues: 98215
Login or Signup to reply.

- AlexanderBartosh
- June 23, 2021 at 5:59 pm
- 0 votes
0
At certain moment in time the skew will be proper.

But when the pods that are to be removed will be removed the skew might be skewed 🙂

Essentially you are facing the limitation described here:
```
Scaling down a Deployment may result in imbalanced Pods distribution.
```
One simple workaround I apply is scale it down + restart/deploy + scale it up.

Then the skew is perfectly fine!
Login or Signup to reply.

- JoeZeng
- October 11, 2021 at 9:34 pm
- 0 votes
0
kubectl rollout restart spins up new pods and then terminate old pods after all the new pods are up and running.

From the pod topology spread constraints known limitations section, constraints don’t remain satisfied when pods are removed and the recommended mitigation is now to use Descheduler , which you already seemed to have been using from your comment.

Login or Signup to reply.

- Ringil
- September 8, 2022 at 5:50 am
- 0 votes
0
In Kubernetes 1.25, you can now use the alpha feature matchLabelKeys to resolve this issue. Because there is an automatically generated pod-template-hash for each version of a deployment added to each pod, that and your app specific label can provide the hashed value to prevent mis-scheduling.
```
topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: DoNotSchedule
      matchLabelKeys:
        - app
        - pod-template-hash
```
matchLabelKeys is a list of pod label keys to select the pods over
which spreading will be calculated. The keys are used to lookup values
from the pod labels, those key-value labels are ANDed with
labelSelector to select the group of existing pods over which
spreading will be calculated for the incoming pod. Keys that don’t
exist in the pod labels will be ignored. A null or empty list means
only match against the labelSelector.

With matchLabelKeys, users don’t need to update the pod.spec between
different revisions. The controller/operator just needs to set
different values to the same label key for different revisions. The
scheduler will assume the values automatically based on
matchLabelKeys. For example, if users use Deployment, they can use the
label keyed with pod-template-hash, which is added automatically by
the Deployment controller, to distinguish between different revisions
in a single Deployment.

Note: The matchLabelKeys field is an alpha field added in 1.25. You
have to enable the MatchLabelKeysInPodTopologySpread feature gate in
order to use it.

From the docs
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

K8s Pod Topology Spread is not respected after rollout? – Nginx

Answers