after a power outage my nodes went to state down
sinfo -a
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
partMain up infinite 4 down* node[001-004]
part1* up infinite 3 down* node[002-004]
part2 up infinite 1 down* node001
I do these commands
/etc/init.d/slurm stop
/etc/init.d/slurm start
sinfo -a
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
partMain up infinite 4 down node[001-004]
part1* up infinite 3 down node[002-004]
part2 up infinite 1 down node001
how could I restart my nodes ?
sinfo -R
REASON USER TIMESTAMP NODELIST
Not responding root 2019-07-23T08:40:25 node[001-004]
$ scontrol update nodename=node001 state=idle
$ scontrol update nodename=node[001-004] state=resume
# the state changes to idle* but for a few seconds then returns to down*
$service --status-all | grep 'slurm'
slurmctld (pid 24000) is running... slurmdbd (pid 4113) is running...
$systemctl status -l slurm
● slurm.service - LSB: slurm daemon management
Loaded: loaded (/etc/rc.d/init.d/slurm; bad; vendor preset: disabled)
Active: failed (Result: exit-code) since Wed 2019-07-24 13:45:38 CEST; 257ms ago
Docs: man:systemd-sysv-generator(8)
Process: 30094 ExecStop=/etc/rc.d/init.d/slurm stop (code=exited, status=1/FAILURE)
Process: 30061 ExecStart=/etc/rc.d/init.d/slurm start (code=exited, status=0/SUCCESS)
Main PID: 30069 (code=exited, status=1/FAILURE)
2
Answers
Try with this after initiating the daemons:
scontrol update nodename=node001 state=idle
See the reason why they are marked as down with
sinfo -R
. Most probably, they will be listed as “unexpectedly rebooted”. You can resume them withThe
ReturnToService
parameter ofslurm.conf
controls whether or not the compute nodes are active when they wake up from an unexpected reboot.