[OMPI users] mpiexec hangs instead of exiting if a worker node dies

Tobias Pfeiffer

2017-09-22 09:58:34 UTC

Hi,

I am currently trying to learn about fault tolerance in MPI so I
experimented a bit with what happens if I kill various components in my MPI
setup, but there are some unexpected hangs in some situations.

I use the following MPI script:

#!/usr/bin/env python

from mpi4py import MPI
import time
import sys
import os
import signal

comm = MPI.COMM_WORLD

for i in range(100):
print("Hello @ %d! I'm rank %d from %d running in total..." % (i,
comm.rank, comm.size))
time.sleep(2)
if comm.rank == 1 and i == 2:
os.system("pstree -p")
# TRY VARIOUS THINGS IN THE LINE BELOW
os.kill(os.getpid(), signal.SIGTERM)

comm.Barrier()

When I run the script above on three nodes, I see the following output:

Hello @ 0! I'm rank 0 from 3 running in total...
Hello @ 0! I'm rank 1 from 3 running in total...
Hello @ 0! I'm rank 2 from 3 running in total...
Hello @ 1! I'm rank 0 from 3 running in total...
Hello @ 1! I'm rank 1 from 3 running in total...
Hello @ 1! I'm rank 2 from 3 running in total...
Hello @ 2! I'm rank 0 from 3 running in total...
Hello @ 2! I'm rank 1 from 3 running in total...
Hello @ 2! I'm rank 2 from 3 running in total...
Hello @ 3! I'm rank 0 from 3 running in total...
Hello @ 3! I'm rank 2 from 3 running in total...

timeout(1)---sshd(8)---sshd(18)---orted(19)-+-python3(23)-+-sh(26)---pstree(27)
|
|-{python3}(24)
|
`-{python3}(25)
|-{orted}(20)
|-{orted}(21)
`-{orted}(22)
Hello @ 4! I'm rank 2 from 3 running in total...

--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 23 on node 8f528c301215
exited on signal 15 (Terminated).

--------------------------------------------------------------------------
[program exit]

(Note that each process runs in a Docker container, so these are in fact
all the processes visible to my program.)

This is nice, but if I want to know what happens if a node or the network
fails, then I also need to check other parts, so I changed
`os.kill(os.getpid(), signal.SIGTERM)` to `os.kill(1, signal.SIGTERM)` so
that all processes on that particular node die. I guess this is very
similar to what would happen if I reboot the system. The output is as
follows:

Hello @ 0! I'm rank 1 from 3 running in total...
Hello @ 0! I'm rank 0 from 3 running in total...
Hello @ 0! I'm rank 2 from 3 running in total...
Hello @ 1! I'm rank 1 from 3 running in total...
Hello @ 1! I'm rank 0 from 3 running in total...
Hello @ 1! I'm rank 2 from 3 running in total...
Hello @ 2! I'm rank 1 from 3 running in total...
Hello @ 2! I'm rank 0 from 3 running in total...
Hello @ 2! I'm rank 2 from 3 running in total...

timeout(1)---sshd(6)---sshd(16)---orted(17)-+-python3(21)-+-sh(24)---pstree(25)
|
|-{python3}(22)
|
`-{python3}(23)
|-{orted}(18)
|-{orted}(19)
`-{orted}(20)
Hello @ 3! I'm rank 1 from 3 running in total...
Connection to 43982adfb734 closed by remote host.

--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.

* the inability to write startup files into /tmp
(--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to
use.

* compilation of the orted with dynamic libraries when static are
required
(e.g., on Cray). Please check your configure cmd line and consider
using
one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).

--------------------------------------------------------------------------
Hello @ 3! I'm rank 0 from 3 running in total...
Hello @ 3! I'm rank 2 from 3 running in total...
Hello @ 4! I'm rank 2 from 3 running in total...
[program hangs]

I ran this several times and sometimes I would also see the following
output:

Hello @ 0! I'm rank 2 from 3 running in total...
Hello @ 0! I'm rank 0 from 3 running in total...
Hello @ 0! I'm rank 1 from 3 running in total...
Hello @ 1! I'm rank 2 from 3 running in total...
Hello @ 1! I'm rank 0 from 3 running in total...
Hello @ 1! I'm rank 1 from 3 running in total...
Hello @ 2! I'm rank 2 from 3 running in total...
Hello @ 2! I'm rank 0 from 3 running in total...
Hello @ 2! I'm rank 1 from 3 running in total...
Hello @ 3! I'm rank 2 from 3 running in total...
Hello @ 3! I'm rank 0 from 3 running in total...

timeout(1)---sshd(7)---sshd(17)---orted(18)-+-python3(22)-+-sh(25)---pstree(26)
|
|-{python3}(23)
|
`-{python3}(24)
|-{orted}(19)
|-{orted}(20)
`-{orted}(21)
Hello @ 3! I'm rank 1 from 3 running in total...

--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

HNP daemon : [[18620,0],0] on node c971706813c7
Remote daemon: [[18620,0],1] on node 626989823da6

Connection to 626989823da6 closed by remote host.
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.

--------------------------------------------------------------------------
[program hangs]

The unexpected behavior is that in both cases `mpiexec` does not terminate,
but hangs. On the node that runs the `mpiexec` command, I see that two
`ssh` processes and one `python3` process are in <defunct> state.

Can you please let me know what I can do so that the `mpiexec` process
terminates when one of the worker nodes goes down?

Thank you,
Tobias