Discussion:
[OMPI users] Node failure handling
Tim Burgess
2017-06-08 13:17:00 UTC
Permalink
Hi!

So I know from searching the archive that this is a repeated topic of
discussion here, and apologies for that, but since it's been a year or
so I thought I'd double-check whether anything has changed before
really starting to tear my hair out too much.

Is there a combination of MCA parameters or similar that will prevent
ORTE from aborting a job when it detects a node failure? This is
using the tcp btl, under slurm.

The application, not written by us and too complicated to re-engineer
at short notice, has a strictly master-slave communication pattern.
The master never blocks on communication from individual slaves, and
apparently can itself detect slaves that have silently disappeared and
reissue the work to those remaining. So from an application
standpoint I believe we should be able to handle this. However, in
all my testing so far the job is aborted as soon as the runtime system
figures out what is going on.

If not, do any users know of another MPI implementation that might
work for this use case? As far as I can tell, FT-MPI has been pretty
quiet the last couple of years?

Thanks in advance,

Tim
r***@open-mpi.org
2017-06-09 14:56:27 UTC
Permalink
It has been awhile since I tested it, but I believe the --enable-recovery option might do what you want.
Post by Tim Burgess
Hi!
So I know from searching the archive that this is a repeated topic of
discussion here, and apologies for that, but since it's been a year or
so I thought I'd double-check whether anything has changed before
really starting to tear my hair out too much.
Is there a combination of MCA parameters or similar that will prevent
ORTE from aborting a job when it detects a node failure? This is
using the tcp btl, under slurm.
The application, not written by us and too complicated to re-engineer
at short notice, has a strictly master-slave communication pattern.
The master never blocks on communication from individual slaves, and
apparently can itself detect slaves that have silently disappeared and
reissue the work to those remaining. So from an application
standpoint I believe we should be able to handle this. However, in
all my testing so far the job is aborted as soon as the runtime system
figures out what is going on.
If not, do any users know of another MPI implementation that might
work for this use case? As far as I can tell, FT-MPI has been pretty
quiet the last couple of years?
Thanks in advance,
Tim
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Tim Burgess
2017-06-27 01:59:35 UTC
Permalink
Hi Ralph, George,

Thanks very much for getting back to me. Alas, neither of these
options seem to accomplish the goal. Both in OpenMPI v2.1.1 and on a
recent master (7002535), with slurm's "--no-kill" and openmpi's
"--enable-recovery", once the node reboots one gets the following
error:

```
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

HNP daemon : [[58323,0],0] on node pnod0330
Remote daemon: [[58323,0],1] on node pnod0331

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
[pnod0330:110442] [[58323,0],0] orted_cmd: received halt_vm cmd
[pnod0332:56161] [[58323,0],2] orted_cmd: received halt_vm cmd
```

I haven't yet tried the hard reboot case with ULFM (these nodes take
forever to come back up), but earlier experiments SIGKILLing the orted
on a compute node led to a very similar message as above, so at this
point I'm not optimistic...

I think my next step is to try with several separate mpiruns and use
mpi_comm_{connect,accept} to plumb everything together before the
application starts. I notice this is the subject of some recent work
on ompi master. Even though the mpiruns will all be associated to the
same ompi-server, do you think this could be sufficient to isolate the
failures?

Cheers,
Tim
Post by r***@open-mpi.org
It has been awhile since I tested it, but I believe the --enable-recovery option might do what you want.
Post by Tim Burgess
Hi!
So I know from searching the archive that this is a repeated topic of
discussion here, and apologies for that, but since it's been a year or
so I thought I'd double-check whether anything has changed before
really starting to tear my hair out too much.
Is there a combination of MCA parameters or similar that will prevent
ORTE from aborting a job when it detects a node failure? This is
using the tcp btl, under slurm.
The application, not written by us and too complicated to re-engineer
at short notice, has a strictly master-slave communication pattern.
The master never blocks on communication from individual slaves, and
apparently can itself detect slaves that have silently disappeared and
reissue the work to those remaining. So from an application
standpoint I believe we should be able to handle this. However, in
all my testing so far the job is aborted as soon as the runtime system
figures out what is going on.
If not, do any users know of another MPI implementation that might
work for this use case? As far as I can tell, FT-MPI has been pretty
quiet the last couple of years?
Thanks in advance,
Tim
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
r***@open-mpi.org
2017-06-27 02:19:22 UTC
Permalink
This post might be inappropriate. Click to display it.
Tim Burgess
2017-06-27 02:39:09 UTC
Permalink
Hi Ralph,

Thanks for the quick response.

Just tried again not under slurm, but the same result... (though I
just did kill -9 orted on the remote node this time)

Any ideas? Do you think my multiple-mpirun idea is worth trying?

Cheers,
Tim


```
[***@bud96 mpi_resilience]$
/d/home/user/2017/openmpi-master-20170608/bin/mpirun --mca plm rsh
--host bud96,pnod0331 -np 2 --npernode 1 --enable-recovery
--debug-daemons $(pwd)/test
( some output from job here )
( I then do kill -9 `pgrep orted` on pnod0331 )
bash: line 1: 161312 Killed
/d/home/user/2017/openmpi-master-20170608/bin/orted -mca
orte_debug_daemons "1" -mca ess "env" -mca ess_base_jobid "581828608"
-mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex
"bud[2:96],pnod[4:331]@0(2)" -mca orte_hnp_uri
"581828608.0;tcp://172.16.251.96,172.31.1.254:58250" -mca plm "rsh"
-mca rmaps_ppr_n_pernode "1" -mca orte_enable_recovery "1"
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

HNP daemon : [[8878,0],0] on node bud96
Remote daemon: [[8878,0],1] on node pnod0331

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
[bud96:20652] [[8878,0],0] orted_cmd: received halt_vm cmd
[bud96:20652] [[8878,0],0] orted_cmd: all routes and children gone - exiting
```
Post by r***@open-mpi.org
Ah - you should have told us you are running under slurm. That does indeed make a difference. When we launch the daemons, we do so with "srun --kill-on-bad-exit” - this means that slurm automatically kills the job if any daemon terminates. We take that measure to avoid leaving zombies behind in the event of a failure.
Try adding “-mca plm rsh” to your mpirun cmd line. This will use the rsh launcher instead of the slurm one, which gives you more control.
Post by Tim Burgess
Hi Ralph, George,
Thanks very much for getting back to me. Alas, neither of these
options seem to accomplish the goal. Both in OpenMPI v2.1.1 and on a
recent master (7002535), with slurm's "--no-kill" and openmpi's
"--enable-recovery", once the node reboots one gets the following
```
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.
HNP daemon : [[58323,0],0] on node pnod0330
Remote daemon: [[58323,0],1] on node pnod0331
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
[pnod0330:110442] [[58323,0],0] orted_cmd: received halt_vm cmd
[pnod0332:56161] [[58323,0],2] orted_cmd: received halt_vm cmd
```
I haven't yet tried the hard reboot case with ULFM (these nodes take
forever to come back up), but earlier experiments SIGKILLing the orted
on a compute node led to a very similar message as above, so at this
point I'm not optimistic...
I think my next step is to try with several separate mpiruns and use
mpi_comm_{connect,accept} to plumb everything together before the
application starts. I notice this is the subject of some recent work
on ompi master. Even though the mpiruns will all be associated to the
same ompi-server, do you think this could be sufficient to isolate the
failures?
Cheers,
Tim
Post by r***@open-mpi.org
It has been awhile since I tested it, but I believe the --enable-recovery option might do what you want.
Post by Tim Burgess
Hi!
So I know from searching the archive that this is a repeated topic of
discussion here, and apologies for that, but since it's been a year or
so I thought I'd double-check whether anything has changed before
really starting to tear my hair out too much.
Is there a combination of MCA parameters or similar that will prevent
ORTE from aborting a job when it detects a node failure? This is
using the tcp btl, under slurm.
The application, not written by us and too complicated to re-engineer
at short notice, has a strictly master-slave communication pattern.
The master never blocks on communication from individual slaves, and
apparently can itself detect slaves that have silently disappeared and
reissue the work to those remaining. So from an application
standpoint I believe we should be able to handle this. However, in
all my testing so far the job is aborted as soon as the runtime system
figures out what is going on.
If not, do any users know of another MPI implementation that might
work for this use case? As far as I can tell, FT-MPI has been pretty
quiet the last couple of years?
Thanks in advance,
Tim
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
r***@open-mpi.org
2017-06-27 04:14:02 UTC
Permalink
Let me poke at it a bit tomorrow - we should be able to avoid the abort. It’s a bug if we can’t.
Post by Tim Burgess
Hi Ralph,
Thanks for the quick response.
Just tried again not under slurm, but the same result... (though I
just did kill -9 orted on the remote node this time)
Any ideas? Do you think my multiple-mpirun idea is worth trying?
Cheers,
Tim
```
/d/home/user/2017/openmpi-master-20170608/bin/mpirun --mca plm rsh
--host bud96,pnod0331 -np 2 --npernode 1 --enable-recovery
--debug-daemons $(pwd)/test
( some output from job here )
( I then do kill -9 `pgrep orted` on pnod0331 )
bash: line 1: 161312 Killed
/d/home/user/2017/openmpi-master-20170608/bin/orted -mca
orte_debug_daemons "1" -mca ess "env" -mca ess_base_jobid "581828608"
-mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex
"581828608.0;tcp://172.16.251.96,172.31.1.254:58250" -mca plm "rsh"
-mca rmaps_ppr_n_pernode "1" -mca orte_enable_recovery "1"
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.
HNP daemon : [[8878,0],0] on node bud96
Remote daemon: [[8878,0],1] on node pnod0331
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
[bud96:20652] [[8878,0],0] orted_cmd: received halt_vm cmd
[bud96:20652] [[8878,0],0] orted_cmd: all routes and children gone - exiting
```
Post by r***@open-mpi.org
Ah - you should have told us you are running under slurm. That does indeed make a difference. When we launch the daemons, we do so with "srun --kill-on-bad-exit” - this means that slurm automatically kills the job if any daemon terminates. We take that measure to avoid leaving zombies behind in the event of a failure.
Try adding “-mca plm rsh” to your mpirun cmd line. This will use the rsh launcher instead of the slurm one, which gives you more control.
Post by Tim Burgess
Hi Ralph, George,
Thanks very much for getting back to me. Alas, neither of these
options seem to accomplish the goal. Both in OpenMPI v2.1.1 and on a
recent master (7002535), with slurm's "--no-kill" and openmpi's
"--enable-recovery", once the node reboots one gets the following
```
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.
HNP daemon : [[58323,0],0] on node pnod0330
Remote daemon: [[58323,0],1] on node pnod0331
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
[pnod0330:110442] [[58323,0],0] orted_cmd: received halt_vm cmd
[pnod0332:56161] [[58323,0],2] orted_cmd: received halt_vm cmd
```
I haven't yet tried the hard reboot case with ULFM (these nodes take
forever to come back up), but earlier experiments SIGKILLing the orted
on a compute node led to a very similar message as above, so at this
point I'm not optimistic...
I think my next step is to try with several separate mpiruns and use
mpi_comm_{connect,accept} to plumb everything together before the
application starts. I notice this is the subject of some recent work
on ompi master. Even though the mpiruns will all be associated to the
same ompi-server, do you think this could be sufficient to isolate the
failures?
Cheers,
Tim
Post by r***@open-mpi.org
It has been awhile since I tested it, but I believe the --enable-recovery option might do what you want.
Post by Tim Burgess
Hi!
So I know from searching the archive that this is a repeated topic of
discussion here, and apologies for that, but since it's been a year or
so I thought I'd double-check whether anything has changed before
really starting to tear my hair out too much.
Is there a combination of MCA parameters or similar that will prevent
ORTE from aborting a job when it detects a node failure? This is
using the tcp btl, under slurm.
The application, not written by us and too complicated to re-engineer
at short notice, has a strictly master-slave communication pattern.
The master never blocks on communication from individual slaves, and
apparently can itself detect slaves that have silently disappeared and
reissue the work to those remaining. So from an application
standpoint I believe we should be able to handle this. However, in
all my testing so far the job is aborted as soon as the runtime system
figures out what is going on.
If not, do any users know of another MPI implementation that might
work for this use case? As far as I can tell, FT-MPI has been pretty
quiet the last couple of years?
Thanks in advance,
Tim
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
George Bosilca
2017-06-27 10:35:20 UTC
Permalink
I would also be interested in having the slurm keep the remaining processes
around, we have been struggling with this on many of the NERSC machines.
That being said the error message comes from orted, and it suggest that
they are giving up because they lose connection to a peer. I was not aware
that this capability exists in the master version of ORTE, but if it does
then it makes our life easier.

George.
Post by r***@open-mpi.org
Let me poke at it a bit tomorrow - we should be able to avoid the abort.
It’s a bug if we can’t.
Post by Tim Burgess
Hi Ralph,
Thanks for the quick response.
Just tried again not under slurm, but the same result... (though I
just did kill -9 orted on the remote node this time)
Any ideas? Do you think my multiple-mpirun idea is worth trying?
Cheers,
Tim
```
/d/home/user/2017/openmpi-master-20170608/bin/mpirun --mca plm rsh
--host bud96,pnod0331 -np 2 --npernode 1 --enable-recovery
--debug-daemons $(pwd)/test
( some output from job here )
( I then do kill -9 `pgrep orted` on pnod0331 )
bash: line 1: 161312 Killed
/d/home/user/2017/openmpi-master-20170608/bin/orted -mca
orte_debug_daemons "1" -mca ess "env" -mca ess_base_jobid "581828608"
-mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex
"581828608.0;tcp://172.16.251.96,172.31.1.254:58250" -mca plm "rsh"
-mca rmaps_ppr_n_pernode "1" -mca orte_enable_recovery "1"
------------------------------------------------------------
--------------
Post by Tim Burgess
ORTE has lost communication with a remote daemon.
HNP daemon : [[8878,0],0] on node bud96
Remote daemon: [[8878,0],1] on node pnod0331
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
------------------------------------------------------------
--------------
Post by Tim Burgess
[bud96:20652] [[8878,0],0] orted_cmd: received halt_vm cmd
[bud96:20652] [[8878,0],0] orted_cmd: all routes and children gone -
exiting
Post by Tim Burgess
```
Post by r***@open-mpi.org
Ah - you should have told us you are running under slurm. That does
indeed make a difference. When we launch the daemons, we do so with "srun
--kill-on-bad-exit” - this means that slurm automatically kills the job if
any daemon terminates. We take that measure to avoid leaving zombies behind
in the event of a failure.
Post by Tim Burgess
Post by r***@open-mpi.org
Try adding “-mca plm rsh” to your mpirun cmd line. This will use the
rsh launcher instead of the slurm one, which gives you more control.
Post by Tim Burgess
Post by r***@open-mpi.org
Post by Tim Burgess
Hi Ralph, George,
Thanks very much for getting back to me. Alas, neither of these
options seem to accomplish the goal. Both in OpenMPI v2.1.1 and on a
recent master (7002535), with slurm's "--no-kill" and openmpi's
"--enable-recovery", once the node reboots one gets the following
```
------------------------------------------------------------
--------------
Post by Tim Burgess
Post by r***@open-mpi.org
Post by Tim Burgess
ORTE has lost communication with a remote daemon.
HNP daemon : [[58323,0],0] on node pnod0330
Remote daemon: [[58323,0],1] on node pnod0331
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
------------------------------------------------------------
--------------
Post by Tim Burgess
Post by r***@open-mpi.org
Post by Tim Burgess
[pnod0330:110442] [[58323,0],0] orted_cmd: received halt_vm cmd
[pnod0332:56161] [[58323,0],2] orted_cmd: received halt_vm cmd
```
I haven't yet tried the hard reboot case with ULFM (these nodes take
forever to come back up), but earlier experiments SIGKILLing the orted
on a compute node led to a very similar message as above, so at this
point I'm not optimistic...
I think my next step is to try with several separate mpiruns and use
mpi_comm_{connect,accept} to plumb everything together before the
application starts. I notice this is the subject of some recent work
on ompi master. Even though the mpiruns will all be associated to the
same ompi-server, do you think this could be sufficient to isolate the
failures?
Cheers,
Tim
Post by r***@open-mpi.org
It has been awhile since I tested it, but I believe the
--enable-recovery option might do what you want.
Post by Tim Burgess
Post by r***@open-mpi.org
Post by Tim Burgess
Post by r***@open-mpi.org
Post by Tim Burgess
Hi!
So I know from searching the archive that this is a repeated topic of
discussion here, and apologies for that, but since it's been a year
or
Post by Tim Burgess
Post by r***@open-mpi.org
Post by Tim Burgess
Post by r***@open-mpi.org
Post by Tim Burgess
so I thought I'd double-check whether anything has changed before
really starting to tear my hair out too much.
Is there a combination of MCA parameters or similar that will prevent
ORTE from aborting a job when it detects a node failure? This is
using the tcp btl, under slurm.
The application, not written by us and too complicated to re-engineer
at short notice, has a strictly master-slave communication pattern.
The master never blocks on communication from individual slaves, and
apparently can itself detect slaves that have silently disappeared
and
Post by Tim Burgess
Post by r***@open-mpi.org
Post by Tim Burgess
Post by r***@open-mpi.org
Post by Tim Burgess
reissue the work to those remaining. So from an application
standpoint I believe we should be able to handle this. However, in
all my testing so far the job is aborted as soon as the runtime
system
Post by Tim Burgess
Post by r***@open-mpi.org
Post by Tim Burgess
Post by r***@open-mpi.org
Post by Tim Burgess
figures out what is going on.
If not, do any users know of another MPI implementation that might
work for this use case? As far as I can tell, FT-MPI has been pretty
quiet the last couple of years?
Thanks in advance,
Tim
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
r***@open-mpi.org
2017-06-27 13:31:16 UTC
Permalink
Actually, the error message is coming from mpirun to indicate that it lost connection to one (or more) of its daemons. This happens because slurm only knows about the remote daemons - mpirun was started outside of “srun”, and so slurm doesn’t know it exists. Thus, when slurm kills the job, it only kills the daemons on the compute nodes, not mpirun. As a result, we always see that error message.

The capability should exist as an option - it used to, but probably has fallen into disrepair. I’ll see if I can bring it back.
I would also be interested in having the slurm keep the remaining processes around, we have been struggling with this on many of the NERSC machines. That being said the error message comes from orted, and it suggest that they are giving up because they lose connection to a peer. I was not aware that this capability exists in the master version of ORTE, but if it does then it makes our life easier.
George.
Let me poke at it a bit tomorrow - we should be able to avoid the abort. It’s a bug if we can’t.
Post by Tim Burgess
Hi Ralph,
Thanks for the quick response.
Just tried again not under slurm, but the same result... (though I
just did kill -9 orted on the remote node this time)
Any ideas? Do you think my multiple-mpirun idea is worth trying?
Cheers,
Tim
```
/d/home/user/2017/openmpi-master-20170608/bin/mpirun --mca plm rsh
--host bud96,pnod0331 -np 2 --npernode 1 --enable-recovery
--debug-daemons $(pwd)/test
( some output from job here )
( I then do kill -9 `pgrep orted` on pnod0331 )
bash: line 1: 161312 Killed
/d/home/user/2017/openmpi-master-20170608/bin/orted -mca
orte_debug_daemons "1" -mca ess "env" -mca ess_base_jobid "581828608"
-mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex
"581828608.0;tcp://172.16.251.96 <http://172.16.251.96/>,172.31.1.254:58250 <http://172.31.1.254:58250/>" -mca plm "rsh"
-mca rmaps_ppr_n_pernode "1" -mca orte_enable_recovery "1"
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.
HNP daemon : [[8878,0],0] on node bud96
Remote daemon: [[8878,0],1] on node pnod0331
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
[bud96:20652] [[8878,0],0] orted_cmd: received halt_vm cmd
[bud96:20652] [[8878,0],0] orted_cmd: all routes and children gone - exiting
```
Ah - you should have told us you are running under slurm. That does indeed make a difference. When we launch the daemons, we do so with "srun --kill-on-bad-exit” - this means that slurm automatically kills the job if any daemon terminates. We take that measure to avoid leaving zombies behind in the event of a failure.
Try adding “-mca plm rsh” to your mpirun cmd line. This will use the rsh launcher instead of the slurm one, which gives you more control.
Post by Tim Burgess
Hi Ralph, George,
Thanks very much for getting back to me. Alas, neither of these
options seem to accomplish the goal. Both in OpenMPI v2.1.1 and on a
recent master (7002535), with slurm's "--no-kill" and openmpi's
"--enable-recovery", once the node reboots one gets the following
```
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.
HNP daemon : [[58323,0],0] on node pnod0330
Remote daemon: [[58323,0],1] on node pnod0331
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
[pnod0330:110442] [[58323,0],0] orted_cmd: received halt_vm cmd
[pnod0332:56161] [[58323,0],2] orted_cmd: received halt_vm cmd
```
I haven't yet tried the hard reboot case with ULFM (these nodes take
forever to come back up), but earlier experiments SIGKILLing the orted
on a compute node led to a very similar message as above, so at this
point I'm not optimistic...
I think my next step is to try with several separate mpiruns and use
mpi_comm_{connect,accept} to plumb everything together before the
application starts. I notice this is the subject of some recent work
on ompi master. Even though the mpiruns will all be associated to the
same ompi-server, do you think this could be sufficient to isolate the
failures?
Cheers,
Tim
Post by r***@open-mpi.org
It has been awhile since I tested it, but I believe the --enable-recovery option might do what you want.
Post by Tim Burgess
Hi!
So I know from searching the archive that this is a repeated topic of
discussion here, and apologies for that, but since it's been a year or
so I thought I'd double-check whether anything has changed before
really starting to tear my hair out too much.
Is there a combination of MCA parameters or similar that will prevent
ORTE from aborting a job when it detects a node failure? This is
using the tcp btl, under slurm.
The application, not written by us and too complicated to re-engineer
at short notice, has a strictly master-slave communication pattern.
The master never blocks on communication from individual slaves, and
apparently can itself detect slaves that have silently disappeared and
reissue the work to those remaining. So from an application
standpoint I believe we should be able to handle this. However, in
all my testing so far the job is aborted as soon as the runtime system
figures out what is going on.
If not, do any users know of another MPI implementation that might
work for this use case? As far as I can tell, FT-MPI has been pretty
quiet the last couple of years?
Thanks in advance,
Tim
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
r***@open-mpi.org
2017-06-27 16:08:35 UTC
Permalink
Okay, this should fix it - https://github.com/open-mpi/ompi/pull/3771 <https://github.com/open-mpi/ompi/pull/3771>
Post by r***@open-mpi.org
Actually, the error message is coming from mpirun to indicate that it lost connection to one (or more) of its daemons. This happens because slurm only knows about the remote daemons - mpirun was started outside of “srun”, and so slurm doesn’t know it exists. Thus, when slurm kills the job, it only kills the daemons on the compute nodes, not mpirun. As a result, we always see that error message.
The capability should exist as an option - it used to, but probably has fallen into disrepair. I’ll see if I can bring it back.
I would also be interested in having the slurm keep the remaining processes around, we have been struggling with this on many of the NERSC machines. That being said the error message comes from orted, and it suggest that they are giving up because they lose connection to a peer. I was not aware that this capability exists in the master version of ORTE, but if it does then it makes our life easier.
George.
Let me poke at it a bit tomorrow - we should be able to avoid the abort. It’s a bug if we can’t.
Post by Tim Burgess
Hi Ralph,
Thanks for the quick response.
Just tried again not under slurm, but the same result... (though I
just did kill -9 orted on the remote node this time)
Any ideas? Do you think my multiple-mpirun idea is worth trying?
Cheers,
Tim
```
/d/home/user/2017/openmpi-master-20170608/bin/mpirun --mca plm rsh
--host bud96,pnod0331 -np 2 --npernode 1 --enable-recovery
--debug-daemons $(pwd)/test
( some output from job here )
( I then do kill -9 `pgrep orted` on pnod0331 )
bash: line 1: 161312 Killed
/d/home/user/2017/openmpi-master-20170608/bin/orted -mca
orte_debug_daemons "1" -mca ess "env" -mca ess_base_jobid "581828608"
-mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex
"581828608.0;tcp://172.16.251.96 <http://172.16.251.96/>,172.31.1.254:58250 <http://172.31.1.254:58250/>" -mca plm "rsh"
-mca rmaps_ppr_n_pernode "1" -mca orte_enable_recovery "1"
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.
HNP daemon : [[8878,0],0] on node bud96
Remote daemon: [[8878,0],1] on node pnod0331
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
[bud96:20652] [[8878,0],0] orted_cmd: received halt_vm cmd
[bud96:20652] [[8878,0],0] orted_cmd: all routes and children gone - exiting
```
Ah - you should have told us you are running under slurm. That does indeed make a difference. When we launch the daemons, we do so with "srun --kill-on-bad-exit” - this means that slurm automatically kills the job if any daemon terminates. We take that measure to avoid leaving zombies behind in the event of a failure.
Try adding “-mca plm rsh” to your mpirun cmd line. This will use the rsh launcher instead of the slurm one, which gives you more control.
Post by Tim Burgess
Hi Ralph, George,
Thanks very much for getting back to me. Alas, neither of these
options seem to accomplish the goal. Both in OpenMPI v2.1.1 and on a
recent master (7002535), with slurm's "--no-kill" and openmpi's
"--enable-recovery", once the node reboots one gets the following
```
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.
HNP daemon : [[58323,0],0] on node pnod0330
Remote daemon: [[58323,0],1] on node pnod0331
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
[pnod0330:110442] [[58323,0],0] orted_cmd: received halt_vm cmd
[pnod0332:56161] [[58323,0],2] orted_cmd: received halt_vm cmd
```
I haven't yet tried the hard reboot case with ULFM (these nodes take
forever to come back up), but earlier experiments SIGKILLing the orted
on a compute node led to a very similar message as above, so at this
point I'm not optimistic...
I think my next step is to try with several separate mpiruns and use
mpi_comm_{connect,accept} to plumb everything together before the
application starts. I notice this is the subject of some recent work
on ompi master. Even though the mpiruns will all be associated to the
same ompi-server, do you think this could be sufficient to isolate the
failures?
Cheers,
Tim
Post by r***@open-mpi.org
It has been awhile since I tested it, but I believe the --enable-recovery option might do what you want.
Post by Tim Burgess
Hi!
So I know from searching the archive that this is a repeated topic of
discussion here, and apologies for that, but since it's been a year or
so I thought I'd double-check whether anything has changed before
really starting to tear my hair out too much.
Is there a combination of MCA parameters or similar that will prevent
ORTE from aborting a job when it detects a node failure? This is
using the tcp btl, under slurm.
The application, not written by us and too complicated to re-engineer
at short notice, has a strictly master-slave communication pattern.
The master never blocks on communication from individual slaves, and
apparently can itself detect slaves that have silently disappeared and
reissue the work to those remaining. So from an application
standpoint I believe we should be able to handle this. However, in
all my testing so far the job is aborted as soon as the runtime system
figures out what is going on.
If not, do any users know of another MPI implementation that might
work for this use case? As far as I can tell, FT-MPI has been pretty
quiet the last couple of years?
Thanks in advance,
Tim
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
George Bosilca
2017-06-09 15:58:42 UTC
Permalink
Tim,

FT-MPI is gone, but the ideas it put forward have been refined and the
software algorithms behind them improved in a newer (and supported) project
ULFM. It features a smaller API, with a much more flexible approach. You
can find more information about it at http://fault-tolerance.org/. The
corresponding implementation (based on an older version of Open MPI 1.6) is
available at https://bitbucket.org/icldistcomp/ulfm

George.
Post by Tim Burgess
Hi!
So I know from searching the archive that this is a repeated topic of
discussion here, and apologies for that, but since it's been a year or
so I thought I'd double-check whether anything has changed before
really starting to tear my hair out too much.
Is there a combination of MCA parameters or similar that will prevent
ORTE from aborting a job when it detects a node failure? This is
using the tcp btl, under slurm.
The application, not written by us and too complicated to re-engineer
at short notice, has a strictly master-slave communication pattern.
The master never blocks on communication from individual slaves, and
apparently can itself detect slaves that have silently disappeared and
reissue the work to those remaining. So from an application
standpoint I believe we should be able to handle this. However, in
all my testing so far the job is aborted as soon as the runtime system
figures out what is going on.
If not, do any users know of another MPI implementation that might
work for this use case? As far as I can tell, FT-MPI has been pretty
quiet the last couple of years?
Thanks in advance,
Tim
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Loading...