skip to Main Content

after a power outage my nodes went to state down

sinfo -a

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
partMain  up      infinite      4   down* node[001-004]
part1*    up      infinite      3   down* node[002-004]
part2     up      infinite      1   down* node001

I do these commands

 /etc/init.d/slurm stop
 /etc/init.d/slurm start

sinfo -a

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
partMain  up      infinite      4   down node[001-004]
part1*    up      infinite      3   down node[002-004]
part2     up      infinite      1   down node001

how could I restart my nodes ?


sinfo -R

REASON USER TIMESTAMP NODELIST
Not responding root 2019-07-23T08:40:25 node[001-004]

$ scontrol update nodename=node001 state=idle    
$ scontrol update nodename=node[001-004] state=resume

# the state changes to idle* but for a few seconds then returns to down*

$service --status-all | grep 'slurm' 
slurmctld (pid 24000) is running... slurmdbd (pid 4113) is running...


$systemctl status -l slurm
● slurm.service - LSB: slurm daemon management
   Loaded: loaded (/etc/rc.d/init.d/slurm; bad; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2019-07-24 13:45:38 CEST; 257ms ago
     Docs: man:systemd-sysv-generator(8)
  Process: 30094 ExecStop=/etc/rc.d/init.d/slurm stop (code=exited, status=1/FAILURE)
  Process: 30061 ExecStart=/etc/rc.d/init.d/slurm start (code=exited, status=0/SUCCESS)
 Main PID: 30069 (code=exited, status=1/FAILURE)

2

Answers


  1. Try with this after initiating the daemons:

    scontrol update nodename=node001 state=idle

    Login or Signup to reply.
  2. See the reason why they are marked as down with sinfo -R. Most probably, they will be listed as “unexpectedly rebooted”. You can resume them with

    scontrol update nodename=node[001-004] state=resume
    

    The ReturnToService parameter of slurm.conf controls whether or not the compute nodes are active when they wake up from an unexpected reboot.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search