oicfar
Posts: 25
Joined: Tue Sep 17, 2019 7:39 am

Slurm: After reboot nodes are in state = down

Thu Sep 19, 2019 8:51 pm

Hi,

I'm working with slurm-wlm 18.08.5-2 on Raspberry Pi Cluster:

- 1 Pi 4 as manager
- 4 Pi 4 nodes

This work fine. But after every restart of the nodes I get this

Code: Select all

[email protected]:~ $ sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
devcluster*    up   infinite      4   down pi-4-node-[1-4]
state. Than I can call

Code: Select all

sudo scontrol update NodeName=<node_name> State=RESUME
for every node and sometimes are all nodes idle and some down
cluster @pi-manager:~ $ sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
devcluster*    up   infinite      2   idle pi-4-node-[1-4]
Status to all nodes

Code: Select all

[email protected]:~ $ scontrol show nodes
NodeName=pi-4-node-1 Arch=armv7l CoresPerSocket=1
   CPUAlloc=0 CPUTot=4 CPULoad=0.24
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.178.141 NodeHostName=pi-4-node-1 Version=18.08
   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
   RealMemory=1 AllocMem=0 FreeMem=3687 Sockets=4 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=devcluster
   BootTime=2019-09-19T17:38:58 SlurmdStartTime=2019-09-19T00:26:36
   CfgTRES=cpu=4,mem=1M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


NodeName=pi-4-node-2 Arch=armv7l CoresPerSocket=1
   CPUAlloc=0 CPUTot=4 CPULoad=0.06
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.178.142 NodeHostName=pi-4-node-2 Version=18.08
   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
   RealMemory=1 AllocMem=0 FreeMem=3687 Sockets=4 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=devcluster
   BootTime=2019-09-19T17:38:57 SlurmdStartTime=2019-09-19T00:26:49
   CfgTRES=cpu=4,mem=1M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


NodeName=pi-4-node-3 Arch=armv7l CoresPerSocket=1
   CPUAlloc=0 CPUTot=4 CPULoad=0.02
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.178.143 NodeHostName=pi-4-node-3 Version=18.08
   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
   RealMemory=1 AllocMem=0 FreeMem=3676 Sockets=4 Boards=1
   State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=devcluster
   BootTime=2019-09-19T17:38:55 SlurmdStartTime=2019-09-19T00:26:45
   CfgTRES=cpu=4,mem=1M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Node unexpectedly rebooted [[email protected]:39:32]

NodeName=pi-4-node-4 Arch=armv7l CoresPerSocket=1
   CPUAlloc=0 CPUTot=4 CPULoad=0.02
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.178.144 NodeHostName=pi-4-node-4 Version=18.08
   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
   RealMemory=1 AllocMem=0 FreeMem=3687 Sockets=4 Boards=1
   State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=devcluster
   BootTime=2019-09-19T17:38:52 SlurmdStartTime=2019-09-19T00:26:47
   CfgTRES=cpu=4,mem=1M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Node unexpectedly rebooted [[email protected]:39:30]

NodeName=pi-manager Arch=armv7l CoresPerSocket=1
   CPUAlloc=0 CPUTot=4 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.178.140 NodeHostName=pi-manager Version=18.08
   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
   RealMemory=1 AllocMem=0 FreeMem=3446 Sockets=4 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   BootTime=2019-09-19T17:35:48 SlurmdStartTime=2019-09-19T08:10:51
   CfgTRES=cpu=4,mem=1M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Nodes which are down, the Reason is:

Code: Select all

Reason=Node unexpectedly rebooted [[email protected]:39:30]
What is the problem that after reboot the nodes are in state = down?

ejolson
Posts: 4928
Joined: Tue Mar 18, 2014 11:47 am

Re: Slurm: After reboot nodes are in state = down

Fri Sep 20, 2019 4:10 am

oicfar wrote:
Thu Sep 19, 2019 8:51 pm
Hi,

I'm working with slurm-wlm 18.08.5-2 on Raspberry Pi Cluster:

- 1 Pi 4 as manager
- 4 Pi 4 nodes

This work fine. But after every restart of the nodes I get this

Code: Select all

[email protected]:~ $ sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
devcluster*    up   infinite      4   down pi-4-node-[1-4]
state. Than I can call

Code: Select all

sudo scontrol update NodeName=<node_name> State=RESUME
for every node and sometimes are all nodes idle and some down
cluster @pi-manager:~ $ sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
devcluster*    up   infinite      2   idle pi-4-node-[1-4]
Status to all nodes

Code: Select all

[email protected]:~ $ scontrol show nodes
NodeName=pi-4-node-1 Arch=armv7l CoresPerSocket=1
   CPUAlloc=0 CPUTot=4 CPULoad=0.24
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.178.141 NodeHostName=pi-4-node-1 Version=18.08
   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
   RealMemory=1 AllocMem=0 FreeMem=3687 Sockets=4 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=devcluster
   BootTime=2019-09-19T17:38:58 SlurmdStartTime=2019-09-19T00:26:36
   CfgTRES=cpu=4,mem=1M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


NodeName=pi-4-node-2 Arch=armv7l CoresPerSocket=1
   CPUAlloc=0 CPUTot=4 CPULoad=0.06
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.178.142 NodeHostName=pi-4-node-2 Version=18.08
   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
   RealMemory=1 AllocMem=0 FreeMem=3687 Sockets=4 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=devcluster
   BootTime=2019-09-19T17:38:57 SlurmdStartTime=2019-09-19T00:26:49
   CfgTRES=cpu=4,mem=1M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


NodeName=pi-4-node-3 Arch=armv7l CoresPerSocket=1
   CPUAlloc=0 CPUTot=4 CPULoad=0.02
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.178.143 NodeHostName=pi-4-node-3 Version=18.08
   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
   RealMemory=1 AllocMem=0 FreeMem=3676 Sockets=4 Boards=1
   State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=devcluster
   BootTime=2019-09-19T17:38:55 SlurmdStartTime=2019-09-19T00:26:45
   CfgTRES=cpu=4,mem=1M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Node unexpectedly rebooted [[email protected]:39:32]

NodeName=pi-4-node-4 Arch=armv7l CoresPerSocket=1
   CPUAlloc=0 CPUTot=4 CPULoad=0.02
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.178.144 NodeHostName=pi-4-node-4 Version=18.08
   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
   RealMemory=1 AllocMem=0 FreeMem=3687 Sockets=4 Boards=1
   State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=devcluster
   BootTime=2019-09-19T17:38:52 SlurmdStartTime=2019-09-19T00:26:47
   CfgTRES=cpu=4,mem=1M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Node unexpectedly rebooted [[email protected]:39:30]

NodeName=pi-manager Arch=armv7l CoresPerSocket=1
   CPUAlloc=0 CPUTot=4 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.178.140 NodeHostName=pi-manager Version=18.08
   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
   RealMemory=1 AllocMem=0 FreeMem=3446 Sockets=4 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   BootTime=2019-09-19T17:35:48 SlurmdStartTime=2019-09-19T08:10:51
   CfgTRES=cpu=4,mem=1M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Nodes which are down, the Reason is:

Code: Select all

Reason=Node unexpectedly rebooted [[email protected]:39:30]
What is the problem that after reboot the nodes are in state = down?
I don't know, but it happens for me too. Having the nodes start in the down state has never bothered me enough to investigate. It seems easy to switch their state to up again. If you figure out a more graceful way of shutting a cluster down, it would be nice to hear about it.

oicfar
Posts: 25
Joined: Tue Sep 17, 2019 7:39 am

Re: Slurm: After reboot nodes are in state = down

Sun Sep 22, 2019 6:48 pm

ejolson wrote:
Fri Sep 20, 2019 4:10 am
oicfar wrote:
Thu Sep 19, 2019 8:51 pm
Hi,

I'm working with slurm-wlm 18.08.5-2 on Raspberry Pi Cluster:

- 1 Pi 4 as manager
- 4 Pi 4 nodes

This work fine. But after every restart of the nodes I get this

Code: Select all

[email protected]:~ $ sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
devcluster*    up   infinite      4   down pi-4-node-[1-4]
state. Than I can call

Code: Select all

sudo scontrol update NodeName=<node_name> State=RESUME
for every node and sometimes are all nodes idle and some down
cluster @pi-manager:~ $ sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
devcluster*    up   infinite      2   idle pi-4-node-[1-4]
Status to all nodes

Code: Select all

[email protected]:~ $ scontrol show nodes
NodeName=pi-4-node-1 Arch=armv7l CoresPerSocket=1
   CPUAlloc=0 CPUTot=4 CPULoad=0.24
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.178.141 NodeHostName=pi-4-node-1 Version=18.08
   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
   RealMemory=1 AllocMem=0 FreeMem=3687 Sockets=4 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=devcluster
   BootTime=2019-09-19T17:38:58 SlurmdStartTime=2019-09-19T00:26:36
   CfgTRES=cpu=4,mem=1M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


NodeName=pi-4-node-2 Arch=armv7l CoresPerSocket=1
   CPUAlloc=0 CPUTot=4 CPULoad=0.06
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.178.142 NodeHostName=pi-4-node-2 Version=18.08
   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
   RealMemory=1 AllocMem=0 FreeMem=3687 Sockets=4 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=devcluster
   BootTime=2019-09-19T17:38:57 SlurmdStartTime=2019-09-19T00:26:49
   CfgTRES=cpu=4,mem=1M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


NodeName=pi-4-node-3 Arch=armv7l CoresPerSocket=1
   CPUAlloc=0 CPUTot=4 CPULoad=0.02
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.178.143 NodeHostName=pi-4-node-3 Version=18.08
   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
   RealMemory=1 AllocMem=0 FreeMem=3676 Sockets=4 Boards=1
   State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=devcluster
   BootTime=2019-09-19T17:38:55 SlurmdStartTime=2019-09-19T00:26:45
   CfgTRES=cpu=4,mem=1M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Node unexpectedly rebooted [[email protected]:39:32]

NodeName=pi-4-node-4 Arch=armv7l CoresPerSocket=1
   CPUAlloc=0 CPUTot=4 CPULoad=0.02
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.178.144 NodeHostName=pi-4-node-4 Version=18.08
   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
   RealMemory=1 AllocMem=0 FreeMem=3687 Sockets=4 Boards=1
   State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=devcluster
   BootTime=2019-09-19T17:38:52 SlurmdStartTime=2019-09-19T00:26:47
   CfgTRES=cpu=4,mem=1M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Node unexpectedly rebooted [[email protected]:39:30]

NodeName=pi-manager Arch=armv7l CoresPerSocket=1
   CPUAlloc=0 CPUTot=4 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.178.140 NodeHostName=pi-manager Version=18.08
   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
   RealMemory=1 AllocMem=0 FreeMem=3446 Sockets=4 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   BootTime=2019-09-19T17:35:48 SlurmdStartTime=2019-09-19T08:10:51
   CfgTRES=cpu=4,mem=1M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Nodes which are down, the Reason is:

Code: Select all

Reason=Node unexpectedly rebooted [[email protected]:39:30]
What is the problem that after reboot the nodes are in state = down?
I don't know, but it happens for me too. Having the nodes start in the down state has never bothered me enough to investigate. It seems easy to switch their state to up again. If you figure out a more graceful way of shutting a cluster down, it would be nice to hear about it.
I understand the flow now.

Code: Select all

https://bugs.schedmd.com/show_bug.cgi?id=1602#c2
But I don't know, why the nodes are after start in the wrong state.

On my Slurm cluster I installed the Docer Swarm. This works perfect. When I shutdown one one, than the node is marked as down in the Swarm. When I reboot the node, after some seconds is the node visible in the Docker Swarm.

I have now a script for setting the right state for the Slurm nodes. But it should work per default correct.

oicfar
Posts: 25
Joined: Tue Sep 17, 2019 7:39 am

Re: Slurm: After reboot nodes are in state = down

Sun Sep 22, 2019 7:47 pm

ejolson wrote:
Fri Sep 20, 2019 4:10 am
oicfar wrote:
Thu Sep 19, 2019 8:51 pm
Hi,

I'm working with slurm-wlm 18.08.5-2 on Raspberry Pi Cluster:

- 1 Pi 4 as manager
- 4 Pi 4 nodes

This work fine. But after every restart of the nodes I get this

Code: Select all

[email protected]:~ $ sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
devcluster*    up   infinite      4   down pi-4-node-[1-4]
state. Than I can call

Code: Select all

sudo scontrol update NodeName=<node_name> State=RESUME
for every node and sometimes are all nodes idle and some down
cluster @pi-manager:~ $ sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
devcluster*    up   infinite      2   idle pi-4-node-[1-4]
Status to all nodes

Code: Select all

[email protected]:~ $ scontrol show nodes
NodeName=pi-4-node-1 Arch=armv7l CoresPerSocket=1
   CPUAlloc=0 CPUTot=4 CPULoad=0.24
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.178.141 NodeHostName=pi-4-node-1 Version=18.08
   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
   RealMemory=1 AllocMem=0 FreeMem=3687 Sockets=4 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=devcluster
   BootTime=2019-09-19T17:38:58 SlurmdStartTime=2019-09-19T00:26:36
   CfgTRES=cpu=4,mem=1M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


NodeName=pi-4-node-2 Arch=armv7l CoresPerSocket=1
   CPUAlloc=0 CPUTot=4 CPULoad=0.06
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.178.142 NodeHostName=pi-4-node-2 Version=18.08
   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
   RealMemory=1 AllocMem=0 FreeMem=3687 Sockets=4 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=devcluster
   BootTime=2019-09-19T17:38:57 SlurmdStartTime=2019-09-19T00:26:49
   CfgTRES=cpu=4,mem=1M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


NodeName=pi-4-node-3 Arch=armv7l CoresPerSocket=1
   CPUAlloc=0 CPUTot=4 CPULoad=0.02
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.178.143 NodeHostName=pi-4-node-3 Version=18.08
   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
   RealMemory=1 AllocMem=0 FreeMem=3676 Sockets=4 Boards=1
   State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=devcluster
   BootTime=2019-09-19T17:38:55 SlurmdStartTime=2019-09-19T00:26:45
   CfgTRES=cpu=4,mem=1M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Node unexpectedly rebooted [[email protected]:39:32]

NodeName=pi-4-node-4 Arch=armv7l CoresPerSocket=1
   CPUAlloc=0 CPUTot=4 CPULoad=0.02
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.178.144 NodeHostName=pi-4-node-4 Version=18.08
   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
   RealMemory=1 AllocMem=0 FreeMem=3687 Sockets=4 Boards=1
   State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=devcluster
   BootTime=2019-09-19T17:38:52 SlurmdStartTime=2019-09-19T00:26:47
   CfgTRES=cpu=4,mem=1M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Node unexpectedly rebooted [[email protected]:39:30]

NodeName=pi-manager Arch=armv7l CoresPerSocket=1
   CPUAlloc=0 CPUTot=4 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.178.140 NodeHostName=pi-manager Version=18.08
   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
   RealMemory=1 AllocMem=0 FreeMem=3446 Sockets=4 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   BootTime=2019-09-19T17:35:48 SlurmdStartTime=2019-09-19T08:10:51
   CfgTRES=cpu=4,mem=1M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Nodes which are down, the Reason is:

Code: Select all

Reason=Node unexpectedly rebooted [[email protected]:39:30]
What is the problem that after reboot the nodes are in state = down?
I don't know, but it happens for me too. Having the nodes start in the down state has never bothered me enough to investigate. It seems easy to switch their state to up again. If you figure out a more graceful way of shutting a cluster down, it would be nice to hear about it.
I don't find any solution. I this case I added to the root crontab:

Code: Select all

sudo su

crontab -e
this line

Code: Select all

@reboot /usr/local/bin/resume_node.sh
and this is the content of the resume_node.sh script

Code: Select all

#!/bin/bash

scontrol -dd show node $HOSTNAME | grep "Node unexpectedly rebooted" && sudo scontrol update NodeName=$HOSTNAME State=RESUME

Return to “General discussion”