I have installed Nvidia’s GPU operator and have my GPU-enabled node automatically labelled (what I treat as important, long list of other labels is there as well):
nvidia.com/gpu.count=1
Node is seemingly schedule able
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Tue, 10 Sep 2024 15:05:17 +0000 Tue, 10 Sep 2024 15:05:17 +0000 CalicoIsUp Calico is running on this node
MemoryPressure False Tue, 10 Sep 2024 16:26:50 +0000 Tue, 10 Sep 2024 15:05:04 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 10 Sep 2024 16:26:50 +0000 Tue, 10 Sep 2024 15:05:04 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Tue, 10 Sep 2024 16:26:50 +0000 Tue, 10 Sep 2024 15:05:04 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Tue, 10 Sep 2024 16:26:50 +0000 Tue, 10 Sep 2024 15:05:04 +0000 KubeletReady kubelet is posting ready status
Node also reports as ready in "kubectl get nodes".
However when I’m looking at demo workload, I see
`Warning FailedScheduling 11s (x17 over 79m) default-scheduler 0/6 nodes are available: 3 Insufficient nvidia.com/gpu, 3 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }. preemption: 0/6 nodes are available: 3 No preemption victims found for incoming pod, 3 Preemption is not helpful for scheduling.`
I have even tried to manually add label node with nvidia.com/gpu=1
no luck so far.
I have followed guide from Nvidia https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html.
Deviations from automatic deployment was that I have installed driver (v550) manually as Nvdia hasn’t generated images for Ubuntu 24.
I see an output in nvidia-smi, which essentially should be true as node being labelled by operator.
Kubernetes v1.31.0
Anything else I am missing?
Tried manually label node and re-create pod.
Expectations are to see pod scheduled
2
Answers
Well, it's embarrassing that I for some reason overlooked failing
nvidia-operator-validator
pod. Would anybody believe "I bet it was running"? Anyway looking at pod logs or description does not give any information. But going onto worker node where container is scheduled (one with GPU) and runningsudo crictl ps -a
shows containerdriver-validation
with increasing fail counter. Those logs are actually useful and besides of (in my case) successfully executingnvidia-smi
gives an answer:I wasn't savvy enough to understand where would I put that ClusterPolicy, but reinstalling gpu-operator with
helm install --wait gpu-operator-1 -n gpu-operator --create-namespace nvidia/gpu-operator --set driver.enabled=false --set validator.driver.env[0].name=DISABLE_DEV_CHAR_SYMLINK_CREATION --set-string validator.driver.env[0].value=true
saved the day.In our case we using Airflow Kubernetes pod operators with the following affinity:
Hence, our nodes needed this label: