[OMPI users] Node failure handling

It has been awhile since I tested it, but I believe the --enable-recovery option might do what you want.

Post by Tim Burgess
Hi!
So I know from searching the archive that this is a repeated topic of
discussion here, and apologies for that, but since it's been a year or
so I thought I'd double-check whether anything has changed before
really starting to tear my hair out too much.
Is there a combination of MCA parameters or similar that will prevent
ORTE from aborting a job when it detects a node failure? This is
using the tcp btl, under slurm.
The application, not written by us and too complicated to re-engineer
at short notice, has a strictly master-slave communication pattern.
The master never blocks on communication from individual slaves, and
apparently can itself detect slaves that have silently disappeared and
reissue the work to those remaining. So from an application
standpoint I believe we should be able to handle this. However, in
all my testing so far the job is aborted as soon as the runtime system
figures out what is going on.
If not, do any users know of another MPI implementation that might
work for this use case? As far as I can tell, FT-MPI has been pretty
quiet the last couple of years?
Thanks in advance,
Tim
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Tim Burgess

2017-06-27 01:59:35 UTC

Hi Ralph, George,

Thanks very much for getting back to me. Alas, neither of these
options seem to accomplish the goal. Both in OpenMPI v2.1.1 and on a
recent master (7002535), with slurm's "--no-kill" and openmpi's
"--enable-recovery", once the node reboots one gets the following
error:

```
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

HNP daemon : [[58323,0],0] on node pnod0330
Remote daemon: [[58323,0],1] on node pnod0331

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
[pnod0330:110442] [[58323,0],0] orted_cmd: received halt_vm cmd
[pnod0332:56161] [[58323,0],2] orted_cmd: received halt_vm cmd
```

I haven't yet tried the hard reboot case with ULFM (these nodes take
forever to come back up), but earlier experiments SIGKILLing the orted
on a compute node led to a very similar message as above, so at this
point I'm not optimistic...

I think my next step is to try with several separate mpiruns and use
mpi_comm_{connect,accept} to plumb everything together before the
application starts. I notice this is the subject of some recent work
on ompi master. Even though the mpiruns will all be associated to the
same ompi-server, do you think this could be sufficient to isolate the
failures?

Cheers,
Tim

Post by r***@open-mpi.org
It has been awhile since I tested it, but I believe the --enable-recovery option might do what you want.

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

r***@open-mpi.org

2017-06-27 02:19:22 UTC

This post might be inappropriate. Click to display it.

Tim Burgess

2017-06-27 02:39:09 UTC

Hi Ralph,

Thanks for the quick response.

Just tried again not under slurm, but the same result... (though I
just did kill -9 orted on the remote node this time)

Any ideas? Do you think my multiple-mpirun idea is worth trying?

Cheers,
Tim

```
[***@bud96 mpi_resilience]$
/d/home/user/2017/openmpi-master-20170608/bin/mpirun --mca plm rsh
--host bud96,pnod0331 -np 2 --npernode 1 --enable-recovery
--debug-daemons $(pwd)/test
( some output from job here )
( I then do kill -9 `pgrep orted` on pnod0331 )
bash: line 1: 161312 Killed
/d/home/user/2017/openmpi-master-20170608/bin/orted -mca
orte_debug_daemons "1" -mca ess "env" -mca ess_base_jobid "581828608"
-mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex
"bud[2:96],pnod[4:331]@0(2)" -mca orte_hnp_uri
"581828608.0;tcp://172.16.251.96,172.31.1.254:58250" -mca plm "rsh"
-mca rmaps_ppr_n_pernode "1" -mca orte_enable_recovery "1"
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

HNP daemon : [[8878,0],0] on node bud96
Remote daemon: [[8878,0],1] on node pnod0331

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
[bud96:20652] [[8878,0],0] orted_cmd: received halt_vm cmd
[bud96:20652] [[8878,0],0] orted_cmd: all routes and children gone - exiting
```

Post by r***@open-mpi.org
Ah - you should have told us you are running under slurm. That does indeed make a difference. When we launch the daemons, we do so with "srun --kill-on-bad-exit” - this means that slurm automatically kills the job if any daemon terminates. We take that measure to avoid leaving zombies behind in the event of a failure.
Try adding “-mca plm rsh” to your mpirun cmd line. This will use the rsh launcher instead of the slurm one, which gives you more control.

Post by Tim Burgess
Hi Ralph, George,
Thanks very much for getting back to me. Alas, neither of these
options seem to accomplish the goal. Both in OpenMPI v2.1.1 and on a
recent master (7002535), with slurm's "--no-kill" and openmpi's
"--enable-recovery", once the node reboots one gets the following
```
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.
HNP daemon : [[58323,0],0] on node pnod0330
Remote daemon: [[58323,0],1] on node pnod0331
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
[pnod0330:110442] [[58323,0],0] orted_cmd: received halt_vm cmd
[pnod0332:56161] [[58323,0],2] orted_cmd: received halt_vm cmd
```
I haven't yet tried the hard reboot case with ULFM (these nodes take
forever to come back up), but earlier experiments SIGKILLing the orted
on a compute node led to a very similar message as above, so at this
point I'm not optimistic...
I think my next step is to try with several separate mpiruns and use
mpi_comm_{connect,accept} to plumb everything together before the
application starts. I notice this is the subject of some recent work
on ompi master. Even though the mpiruns will all be associated to the
same ompi-server, do you think this could be sufficient to isolate the
failures?
Cheers,
Tim

Post by r***@open-mpi.org
It has been awhile since I tested it, but I believe the --enable-recovery option might do what you want.

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

r***@open-mpi.org

2017-06-27 04:14:02 UTC

Let me poke at it a bit tomorrow - we should be able to avoid the abort. It’s a bug if we can’t.

Post by Tim Burgess
Hi Ralph,
Thanks for the quick response.
Just tried again not under slurm, but the same result... (though I
just did kill -9 orted on the remote node this time)
Any ideas? Do you think my multiple-mpirun idea is worth trying?
Cheers,
Tim
```
/d/home/user/2017/openmpi-master-20170608/bin/mpirun --mca plm rsh
--host bud96,pnod0331 -np 2 --npernode 1 --enable-recovery
--debug-daemons $(pwd)/test
( some output from job here )
( I then do kill -9 `pgrep orted` on pnod0331 )
bash: line 1: 161312 Killed
/d/home/user/2017/openmpi-master-20170608/bin/orted -mca
orte_debug_daemons "1" -mca ess "env" -mca ess_base_jobid "581828608"
-mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex
"581828608.0;tcp://172.16.251.96,172.31.1.254:58250" -mca plm "rsh"
-mca rmaps_ppr_n_pernode "1" -mca orte_enable_recovery "1"
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.
HNP daemon : [[8878,0],0] on node bud96
Remote daemon: [[8878,0],1] on node pnod0331
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
[bud96:20652] [[8878,0],0] orted_cmd: received halt_vm cmd
[bud96:20652] [[8878,0],0] orted_cmd: all routes and children gone - exiting
```

Post by r***@open-mpi.org
It has been awhile since I tested it, but I believe the --enable-recovery option might do what you want.

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

George Bosilca

2017-06-27 10:35:20 UTC

I would also be interested in having the slurm keep the remaining processes
around, we have been struggling with this on many of the NERSC machines.
That being said the error message comes from orted, and it suggest that
they are giving up because they lose connection to a peer. I was not aware
that this capability exists in the master version of ORTE, but if it does
then it makes our life easier.

George.

Post by r***@open-mpi.org
Let me poke at it a bit tomorrow - we should be able to avoid the abort.
Itâs a bug if we canât.

--------------

Post by Tim Burgess
ORTE has lost communication with a remote daemon.
HNP daemon : [[8878,0],0] on node bud96
Remote daemon: [[8878,0],1] on node pnod0331
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
------------------------------------------------------------

--------------

Post by Tim Burgess
[bud96:20652] [[8878,0],0] orted_cmd: received halt_vm cmd
[bud96:20652] [[8878,0],0] orted_cmd: all routes and children gone -

exiting

Post by Tim Burgess
```

Post by r***@open-mpi.org
Ah - you should have told us you are running under slurm. That does

indeed make a difference. When we launch the daemons, we do so with "srun
--kill-on-bad-exitâ - this means that slurm automatically kills the job if
any daemon terminates. We take that measure to avoid leaving zombies behind
in the event of a failure.

Post by r***@open-mpi.org
Try adding â-mca plm rshâ to your mpirun cmd line. This will use the

rsh launcher instead of the slurm one, which gives you more control.

--------------

Post by Tim Burgess
ORTE has lost communication with a remote daemon.
HNP daemon : [[58323,0],0] on node pnod0330
Remote daemon: [[58323,0],1] on node pnod0331
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
------------------------------------------------------------

--------------

Post by Tim Burgess
[pnod0330:110442] [[58323,0],0] orted_cmd: received halt_vm cmd
[pnod0332:56161] [[58323,0],2] orted_cmd: received halt_vm cmd
```
I haven't yet tried the hard reboot case with ULFM (these nodes take
forever to come back up), but earlier experiments SIGKILLing the orted
on a compute node led to a very similar message as above, so at this
point I'm not optimistic...
I think my next step is to try with several separate mpiruns and use
mpi_comm_{connect,accept} to plumb everything together before the
application starts. I notice this is the subject of some recent work
on ompi master. Even though the mpiruns will all be associated to the
same ompi-server, do you think this could be sufficient to isolate the
failures?
Cheers,
Tim

Post by r***@open-mpi.org
It has been awhile since I tested it, but I believe the

--enable-recovery option might do what you want.

Post by Tim Burgess
Hi!
So I know from searching the archive that this is a repeated topic of
discussion here, and apologies for that, but since it's been a year

Post by Tim Burgess
so I thought I'd double-check whether anything has changed before
really starting to tear my hair out too much.
Is there a combination of MCA parameters or similar that will prevent
ORTE from aborting a job when it detects a node failure? This is
using the tcp btl, under slurm.
The application, not written by us and too complicated to re-engineer
at short notice, has a strictly master-slave communication pattern.
The master never blocks on communication from individual slaves, and
apparently can itself detect slaves that have silently disappeared

and

Post by Tim Burgess
reissue the work to those remaining. So from an application
standpoint I believe we should be able to handle this. However, in
all my testing so far the job is aborted as soon as the runtime

system

Post by Tim Burgess
figures out what is going on.
If not, do any users know of another MPI implementation that might
work for this use case? As far as I can tell, FT-MPI has been pretty
quiet the last couple of years?
Thanks in advance,
Tim
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

r***@open-mpi.org

2017-06-27 13:31:16 UTC

Actually, the error message is coming from mpirun to indicate that it lost connection to one (or more) of its daemons. This happens because slurm only knows about the remote daemons - mpirun was started outside of âsrunâ, and so slurm doesnât know it exists. Thus, when slurm kills the job, it only kills the daemons on the compute nodes, not mpirun. As a result, we always see that error message.

The capability should exist as an option - it used to, but probably has fallen into disrepair. Iâll see if I can bring it back.

I would also be interested in having the slurm keep the remaining processes around, we have been struggling with this on many of the NERSC machines. That being said the error message comes from orted, and it suggest that they are giving up because they lose connection to a peer. I was not aware that this capability exists in the master version of ORTE, but if it does then it makes our life easier.
George.
Let me poke at it a bit tomorrow - we should be able to avoid the abort. Itâs a bug if we canât.

Post by Tim Burgess
Hi Ralph,
Thanks for the quick response.
Just tried again not under slurm, but the same result... (though I
just did kill -9 orted on the remote node this time)
Any ideas? Do you think my multiple-mpirun idea is worth trying?
Cheers,
Tim
```
/d/home/user/2017/openmpi-master-20170608/bin/mpirun --mca plm rsh
--host bud96,pnod0331 -np 2 --npernode 1 --enable-recovery
--debug-daemons $(pwd)/test
( some output from job here )
( I then do kill -9 `pgrep orted` on pnod0331 )
bash: line 1: 161312 Killed
/d/home/user/2017/openmpi-master-20170608/bin/orted -mca
orte_debug_daemons "1" -mca ess "env" -mca ess_base_jobid "581828608"
-mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex
"581828608.0;tcp://172.16.251.96 <http://172.16.251.96/>,172.31.1.254:58250 <http://172.31.1.254:58250/>" -mca plm "rsh"
-mca rmaps_ppr_n_pernode "1" -mca orte_enable_recovery "1"
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.
HNP daemon : [[8878,0],0] on node bud96
Remote daemon: [[8878,0],1] on node pnod0331
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
[bud96:20652] [[8878,0],0] orted_cmd: received halt_vm cmd
[bud96:20652] [[8878,0],0] orted_cmd: all routes and children gone - exiting
```

Ah - you should have told us you are running under slurm. That does indeed make a difference. When we launch the daemons, we do so with "srun --kill-on-bad-exitâ - this means that slurm automatically kills the job if any daemon terminates. We take that measure to avoid leaving zombies behind in the event of a failure.
Try adding â-mca plm rshâ to your mpirun cmd line. This will use the rsh launcher instead of the slurm one, which gives you more control.

Post by r***@open-mpi.org
It has been awhile since I tested it, but I believe the --enable-recovery option might do what you want.

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

r***@open-mpi.org

2017-06-27 16:08:35 UTC

Okay, this should fix it - https://github.com/open-mpi/ompi/pull/3771 <https://github.com/open-mpi/ompi/pull/3771>

Post by r***@open-mpi.org
Actually, the error message is coming from mpirun to indicate that it lost connection to one (or more) of its daemons. This happens because slurm only knows about the remote daemons - mpirun was started outside of âsrunâ, and so slurm doesnât know it exists. Thus, when slurm kills the job, it only kills the daemons on the compute nodes, not mpirun. As a result, we always see that error message.
The capability should exist as an option - it used to, but probably has fallen into disrepair. Iâll see if I can bring it back.

I would also be interested in having the slurm keep the remaining processes around, we have been struggling with this on many of the NERSC machines. That being said the error message comes from orted, and it suggest that they are giving up because they lose connection to a peer. I was not aware that this capability exists in the master version of ORTE, but if it does then it makes our life easier.
George.
Let me poke at it a bit tomorrow - we should be able to avoid the abort. Itâs a bug if we canât.

Post by Tim Burgess
Hi Ralph,
Thanks for the quick response.
Just tried again not under slurm, but the same result... (though I
just did kill -9 orted on the remote node this time)
Any ideas? Do you think my multiple-mpirun idea is worth trying?
Cheers,
Tim
```
/d/home/user/2017/openmpi-master-20170608/bin/mpirun --mca plm rsh
--host bud96,pnod0331 -np 2 --npernode 1 --enable-recovery
--debug-daemons $(pwd)/test
( some output from job here )
( I then do kill -9 `pgrep orted` on pnod0331 )
bash: line 1: 161312 Killed
/d/home/user/2017/openmpi-master-20170608/bin/orted -mca
orte_debug_daemons "1" -mca ess "env" -mca ess_base_jobid "581828608"
-mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex
"581828608.0;tcp://172.16.251.96 <http://172.16.251.96/>,172.31.1.254:58250 <http://172.31.1.254:58250/>" -mca plm "rsh"
-mca rmaps_ppr_n_pernode "1" -mca orte_enable_recovery "1"
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.
HNP daemon : [[8878,0],0] on node bud96
Remote daemon: [[8878,0],1] on node pnod0331
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
[bud96:20652] [[8878,0],0] orted_cmd: received halt_vm cmd
[bud96:20652] [[8878,0],0] orted_cmd: all routes and children gone - exiting
```

Post by r***@open-mpi.org
It has been awhile since I tested it, but I believe the --enable-recovery option might do what you want.

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

George Bosilca

2017-06-09 15:58:42 UTC