I would also be interested in having the slurm keep the remaining processes around, we have been struggling with this on many of the NERSC machines. That being said the error message comes from orted, and it suggest that they are giving up because they lose connection to a peer. I was not aware that this capability exists in the master version of ORTE, but if it does then it makes our life easier.
George.
Let me poke at it a bit tomorrow - we should be able to avoid the abort. Itâs a bug if we canât.
Post by Tim BurgessHi Ralph,
Thanks for the quick response.
Just tried again not under slurm, but the same result... (though I
just did kill -9 orted on the remote node this time)
Any ideas? Do you think my multiple-mpirun idea is worth trying?
Cheers,
Tim
```
/d/home/user/2017/openmpi-master-20170608/bin/mpirun --mca plm rsh
--host bud96,pnod0331 -np 2 --npernode 1 --enable-recovery
--debug-daemons $(pwd)/test
( some output from job here )
( I then do kill -9 `pgrep orted` on pnod0331 )
bash: line 1: 161312 Killed
/d/home/user/2017/openmpi-master-20170608/bin/orted -mca
orte_debug_daemons "1" -mca ess "env" -mca ess_base_jobid "581828608"
-mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex
"581828608.0;tcp://172.16.251.96 <http://172.16.251.96/>,172.31.1.254:58250 <http://172.31.1.254:58250/>" -mca plm "rsh"
-mca rmaps_ppr_n_pernode "1" -mca orte_enable_recovery "1"
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.
HNP daemon : [[8878,0],0] on node bud96
Remote daemon: [[8878,0],1] on node pnod0331
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
[bud96:20652] [[8878,0],0] orted_cmd: received halt_vm cmd
[bud96:20652] [[8878,0],0] orted_cmd: all routes and children gone - exiting
```
Ah - you should have told us you are running under slurm. That does indeed make a difference. When we launch the daemons, we do so with "srun --kill-on-bad-exitâ - this means that slurm automatically kills the job if any daemon terminates. We take that measure to avoid leaving zombies behind in the event of a failure.
Try adding â-mca plm rshâ to your mpirun cmd line. This will use the rsh launcher instead of the slurm one, which gives you more control.
Post by Tim BurgessHi Ralph, George,
Thanks very much for getting back to me. Alas, neither of these
options seem to accomplish the goal. Both in OpenMPI v2.1.1 and on a
recent master (7002535), with slurm's "--no-kill" and openmpi's
"--enable-recovery", once the node reboots one gets the following
```
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.
HNP daemon : [[58323,0],0] on node pnod0330
Remote daemon: [[58323,0],1] on node pnod0331
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
[pnod0330:110442] [[58323,0],0] orted_cmd: received halt_vm cmd
[pnod0332:56161] [[58323,0],2] orted_cmd: received halt_vm cmd
```
I haven't yet tried the hard reboot case with ULFM (these nodes take
forever to come back up), but earlier experiments SIGKILLing the orted
on a compute node led to a very similar message as above, so at this
point I'm not optimistic...
I think my next step is to try with several separate mpiruns and use
mpi_comm_{connect,accept} to plumb everything together before the
application starts. I notice this is the subject of some recent work
on ompi master. Even though the mpiruns will all be associated to the
same ompi-server, do you think this could be sufficient to isolate the
failures?
Cheers,
Tim
Post by r***@open-mpi.orgIt has been awhile since I tested it, but I believe the --enable-recovery option might do what you want.
Post by Tim BurgessHi!
So I know from searching the archive that this is a repeated topic of
discussion here, and apologies for that, but since it's been a year or
so I thought I'd double-check whether anything has changed before
really starting to tear my hair out too much.
Is there a combination of MCA parameters or similar that will prevent
ORTE from aborting a job when it detects a node failure? This is
using the tcp btl, under slurm.
The application, not written by us and too complicated to re-engineer
at short notice, has a strictly master-slave communication pattern.
The master never blocks on communication from individual slaves, and
apparently can itself detect slaves that have silently disappeared and
reissue the work to those remaining. So from an application
standpoint I believe we should be able to handle this. However, in
all my testing so far the job is aborted as soon as the runtime system
figures out what is going on.
If not, do any users know of another MPI implementation that might
work for this use case? As far as I can tell, FT-MPI has been pretty
quiet the last couple of years?
Thanks in advance,
Tim
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>