Restart nodes in state down - CentOS

probl232me0123
July 22, 2019
181 views
0 votes
2 Answers

after a power outage my nodes went to state down

sinfo -a

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
partMain  up      infinite      4   down* node[001-004]
part1*    up      infinite      3   down* node[002-004]
part2     up      infinite      1   down* node001

I do these commands

 /etc/init.d/slurm stop
 /etc/init.d/slurm start

sinfo -a

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
partMain  up      infinite      4   down node[001-004]
part1*    up      infinite      3   down node[002-004]
part2     up      infinite      1   down node001

how could I restart my nodes ?

sinfo -R

REASON USER TIMESTAMP NODELIST Not responding root 2019-07-23T08:40:25 node[001-004]

$ scontrol update nodename=node001 state=idle    
$ scontrol update nodename=node[001-004] state=resume

# the state changes to idle* but for a few seconds then returns to down*

$service --status-all | grep 'slurm' 
slurmctld (pid 24000) is running... slurmdbd (pid 4113) is running...


$systemctl status -l slurm
● slurm.service - LSB: slurm daemon management
   Loaded: loaded (/etc/rc.d/init.d/slurm; bad; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2019-07-24 13:45:38 CEST; 257ms ago
     Docs: man:systemd-sysv-generator(8)
  Process: 30094 ExecStop=/etc/rc.d/init.d/slurm stop (code=exited, status=1/FAILURE)
  Process: 30061 ExecStart=/etc/rc.d/init.d/slurm start (code=exited, status=0/SUCCESS)
 Main PID: 30069 (code=exited, status=1/FAILURE)

Tags: centos slurm

Answers

- BubEspinja
- July 22, 2019 at 3:41 pm
- 0 votes
0
Try with this after initiating the daemons:

scontrol update nodename=node001 state=idle

Login or Signup to reply.

- damienfrancois
- July 22, 2019 at 4:55 pm
- 0 votes
0
See the reason why they are marked as down with sinfo -R. Most probably, they will be listed as “unexpectedly rebooted”. You can resume them with
```
scontrol update nodename=node[001-004] state=resume
```
The ReturnToService parameter of slurm.conf controls whether or not the compute nodes are active when they wake up from an unexpected reboot.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Restart nodes in state down – CentOS

Answers