skip to Main Content

I’m trying to spread my ingress-nginx-controller pods such that:

  • Each availability zone has the same # of pods (+- 1).
  • Pods prefer Nodes that currently run the least pods.

Following other questions here, I have set up Pod Topology Spread Constraints in my pod deployment:

      replicas: 4
      topologySpreadConstraints:
      - labelSelector:
          matchLabels:
            app.kubernetes.io/name: ingress-nginx
        maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
      - labelSelector:
          matchLabels:
            app.kubernetes.io/name: ingress-nginx
        maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule

I currently have 2 Nodes, each in a different availability zone:

$ kubectl get nodes --label-columns=topology.kubernetes.io/zone,kubernetes.io/hostname
NAME                            STATUS   ROLES                  AGE    VERSION   ZONE         HOSTNAME
ip-{{node1}}.compute.internal   Ready    node                   136m   v1.20.2   us-west-2a   ip-{{node1}}.compute.internal
ip-{{node2}}.compute.internal   Ready    node                   20h    v1.20.2   us-west-2b   ip-{{node2}}.compute.internal

After running kubectl rollout restart for that deployment, I get 3 pods in one Node, and 1 pod in the other, which has a skew of 2 > 1:

$ kubectl describe pod ingress-nginx-controller -n ingress-nginx | grep 'Node:'
Node:         ip-{{node1}}.compute.internal/{{node1}}
Node:         ip-{{node2}}.compute.internal/{{node2}}
Node:         ip-{{node1}}.compute.internal/{{node1}}
Node:         ip-{{node1}}.compute.internal/{{node1}}

Why is my constraint not respected? How can I debug the pod scheduler?

My kubectl version:

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.0-beta.0.607+269d62d895c297", GitCommit:"269d62d895c29743931bfaaec6e8d37ced43c35f", GitTreeState:"clean", BuildDate:"2021-03-05T22:28:02Z", GoVersion:"go1.16", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-13T13:20:00Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}

4

Answers


  1. Giving more visibility on the comment:

    Daemonset worked and was easy enough. It won’t work for our deployments with several pods per node, but there are mitigations there (descheduler) and it should self resolve as the cluster grows.

    Please consider this answer as a workaround:

    A DaemonSet ensures that all (or some) Nodes run a copy of a Pod. As nodes are added to the cluster, Pods are added to them. As nodes are removed from the cluster, those Pods are garbage collected. Deleting a DaemonSet will clean up the Pods it created.

    Kubernetes.io: Docs: Concepts: Workloads: Controllers: Daemonset

    An example of it could be following:

    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: nginx
    spec:
      selector:
        matchLabels:
          name: nginx
      template:
        metadata:
          labels:
            name: nginx 
        spec:
          #nodeSelector: 
              #schedule: here 
          tolerations:
          # this toleration is to have the daemonset runnable on master nodes
          # remove it if your masters can't run pods
          - key: node-role.kubernetes.io/master
            effect: NoSchedule
          containers:
          - name: nginx
            image: nginx
    

    This definition will spawn a Pod on each Node in the cluster. You can further limit the Pod scheduling by specifying a nodeSelector.

    Assuming that you have some controller/logic responsible for tagging the nodes with a specific label you can schedule Pods on specific Nodes. The part responsible for it is commented out in above manifest:

          nodeSelector: 
              schedule: here 
    

    The nodes (raven-sgdm and raven-xvvw are labeled): :

    NAME         STATUS   ROLES    AGE    VERSION
    raven-6k6m   Ready    <none>   159m   v1.20
    raven-sgdm   Ready    <none>   159m   v1.20
    raven-xvvw   Ready    <none>   159m   v1.20
    

    The Daemonset:

    NAME    DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
    nginx   2         2         2       2            2           schedule=here   99m
    

    Additional resources:

    Login or Signup to reply.
  2. At certain moment in time the skew will be proper.

    But when the pods that are to be removed will be removed the skew might be skewed 🙂

    Essentially you are facing the limitation described here:

    Scaling down a Deployment may result in imbalanced Pods distribution.
    

    One simple workaround I apply is scale it down + restart/deploy + scale it up.

    Then the skew is perfectly fine!

    Login or Signup to reply.
  3. kubectl rollout restart spins up new pods and then terminate old pods after all the new pods are up and running.

    From the pod topology spread constraints known limitations section, constraints don’t remain satisfied when pods are removed and the recommended mitigation is now to use Descheduler , which you already seemed to have been using from your comment.

    Login or Signup to reply.
  4. In Kubernetes 1.25, you can now use the alpha feature matchLabelKeys to resolve this issue. Because there is an automatically generated pod-template-hash for each version of a deployment added to each pod, that and your app specific label can provide the hashed value to prevent mis-scheduling.

    topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: DoNotSchedule
          matchLabelKeys:
            - app
            - pod-template-hash
    

    matchLabelKeys is a list of pod label keys to select the pods over
    which spreading will be calculated. The keys are used to lookup values
    from the pod labels, those key-value labels are ANDed with
    labelSelector to select the group of existing pods over which
    spreading will be calculated for the incoming pod. Keys that don’t
    exist in the pod labels will be ignored. A null or empty list means
    only match against the labelSelector.

    With matchLabelKeys, users don’t need to update the pod.spec between
    different revisions. The controller/operator just needs to set
    different values to the same label key for different revisions. The
    scheduler will assume the values automatically based on
    matchLabelKeys. For example, if users use Deployment, they can use the
    label keyed with pod-template-hash, which is added automatically by
    the Deployment controller, to distinguish between different revisions
    in a single Deployment.

    Note: The matchLabelKeys field is an alpha field added in 1.25. You
    have to enable the MatchLabelKeysInPodTopologySpread feature gate in
    order to use it.

    From the docs

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search