I was running some jobs in the SLURM of my PC, and the computer rebooted.
Once the computer was back on, I saw in the squeue
that the jobs that were running before reboot were not running anymore due to a drain state. It seemed they had been automatically requeued after the reboot.
I couldn’t submit more jobs, because the node was drained. So I did scancel
the jobs that were automatically requeued.
The problem is that I cannot free the node. I tried a few things:
-
Restarting
slurmctld
andslurmd
-
"undraining" the nodes as explained in another question, but no success. The commands ran without any output (I assume this is good), but the state of the node did not change.
-
I then tried manually rebooting the system to see if anything would change
Running scontrol show node neuropc
gives
[...]
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
[...]
Reason=Low RealMemory [slurm@2023-02-05T22:06:33]
Weirdly, the System Monitor shows that all the 8 cores keep having activity between 5% and 15%, whereas in the Process tab it shows only one app (TeamViewer) using less than of 4% processor.
So I suspect the job I was running somehow was kept running after reboot or are still on hold by SLURM.
I use Ubuntu 20.04
and slurm 19.05.5
.
2
Answers
This answer solved my problem. Copying it here:
To strictly answer the question ; no they cannot. They might or might not be requeued depending on the Slurm configuration, and restarted either from scratch or from the latest checkpoint if the job is able to do checkpoint/restart. But there is not way a running process can survive a server reboot.