skip to Main Content

I was running some jobs in the SLURM of my PC, and the computer rebooted.

Once the computer was back on, I saw in the squeue that the jobs that were running before reboot were not running anymore due to a drain state. It seemed they had been automatically requeued after the reboot.

I couldn’t submit more jobs, because the node was drained. So I did scancel the jobs that were automatically requeued.

The problem is that I cannot free the node. I tried a few things:

  1. Restarting slurmctld and slurmd

  2. "undraining" the nodes as explained in another question, but no success. The commands ran without any output (I assume this is good), but the state of the node did not change.

  3. I then tried manually rebooting the system to see if anything would change

Running scontrol show node neuropc gives

[...]
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
[...]
Reason=Low RealMemory [slurm@2023-02-05T22:06:33]

Weirdly, the System Monitor shows that all the 8 cores keep having activity between 5% and 15%, whereas in the Process tab it shows only one app (TeamViewer) using less than of 4% processor.

So I suspect the job I was running somehow was kept running after reboot or are still on hold by SLURM.

I use Ubuntu 20.04 and slurm 19.05.5.

2

Answers


  1. Chosen as BEST ANSWER

    This answer solved my problem. Copying it here:

    This could be that RealMemory=541008 in slurm.conf is too high for your system. Try lowering the value. Lets suppose you have indeed 541 Gb of RAM installed: change it to RealMemory=500000, do a scontrol reconfigure and then a scontrol update nodename=transgen-4 state=resume. If that works, you could try to raise the value a bit.


  2. To strictly answer the question ; no they cannot. They might or might not be requeued depending on the Slurm configuration, and restarted either from scratch or from the latest checkpoint if the job is able to do checkpoint/restart. But there is not way a running process can survive a server reboot.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search