[OMPI users] Abort/ Deadlock issue in allreduce

Discussion:

Christof Koehler

2016-12-07 10:38:00 UTC

Hello everybody,

I am observing a deadlock in allreduce with openmpi 2.0.1 on a Single
node. A stack tracke (pstack) of one rank is below showing the program (vasp
5.3.5) and the two psm2 progress threads. However:

In fact, the vasp input is not ok and it should abort at the point where
it hangs. It does when using mvapich 2.2. With openmpi 2.0.1 it just
deadlocks in some allreduce operation. Originally it was started with 20
ranks, when it hangs there are only 19 left. From the PIDs I would
assume it is the master rank which is missing. So, this looks like a
failure to terminate.

With 1.10 I get a clean
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 18789 on node node109
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Any ideas what to try ? Of course in this situation it may well be the
program. Still, with the observed difference between 2.0.1 and 1.10 (and
mvapich) this might be interesting to someone.

Best Regards

Christof

Thread 3 (Thread 0x2ad362577700 (LWP 4629)):
#0 0x00002ad35b1562c3 in epoll_wait () from /lib64/libc.so.6
#1 0x00002ad35d114f42 in epoll_dispatch () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2 0x00002ad35d116751 in opal_libevent2022_event_base_loop () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#3 0x00002ad35d16e996 in progress_engine () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#4 0x00002ad359efbdc5 in start_thread () from /lib64/libpthread.so.0
#5 0x00002ad35b155ced in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x2ad362778700 (LWP 4640)):
#0 0x00002ad35b14b69d in poll () from /lib64/libc.so.6 #1 0x00002ad35d11dc42 in poll_dispatch () from
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2 0x00002ad35d116751 in opal_libevent2022_event_base_loop () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#3 0x00002ad35d0c61d1 in progress_engine () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#4 0x00002ad359efbdc5 in start_thread () from /lib64/libpthread.so.0
#5 0x00002ad35b155ced in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x2ad35978d040 (LWP 4609)):
#0 0x00002ad35b14b69d in poll () from /lib64/libc.so.6
#1 0x00002ad35d11dc42 in poll_dispatch () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2 0x00002ad35d116751 in opal_libevent2022_event_base_loop () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#3 0x00002ad35d0c28cf in opal_progress () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#4 0x00002ad35adce8d8 in ompi_request_wait_completion () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#5 0x00002ad35adce838 in mca_pml_cm_recv () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#6 0x00002ad35ad4da42 in ompi_coll_base_allreduce_intra_recursivedoubling () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#7 0x00002ad35ad52906 in ompi_coll_tuned_allreduce_intra_dec_fixed () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#8 0x00002ad35ad1f0f4 in PMPI_Allreduce () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#9 0x00002ad35aa99c38 in pmpi_allreduce__ () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi_mpifh.so.20
#10 0x000000000045f8c6 in m_sum_i_ ()
#11 0x0000000000e1ce69 in mlwf_mp_mlwf_wannier90_ ()
#12 0x00000000004331ff in vamp () at main.F:2640
#13 0x000000000040ea1e in main ()
#14 0x00002ad35b080b15 in __libc_start_main () from /lib64/libc.so.6
#15 0x000000000040e929 in _start ()

--
Dr. rer. nat. Christof KÃ¶hler email: ***@bccms.uni-bremen.de
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen

PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/

Gilles Gouaillardet

2016-12-07 11:23:43 UTC

Permalink

Christoph,

can you please try again with

mpirun --mca btl tcp,self --mca pml ob1 ...

that will help figuring out whether pml/cm and/or mtl/psm2 is involved or not.

if that causes a crash, then can you please try

mpirun --mca btl tcp,self --mca pml ob1 --mca coll ^tuned ...

that will help figuring out whether coll/tuned is involved or not

coll/tuned is known not to correctly handle collectives with different
but matching signatures
(e.g. some tasks invoke the collective with one vector of N elements,
and some other invoke
the same collective with N elements)

if everything fails, can you describe of MPI_Allreduce is invoked ?
/* number of tasks, datatype, number of elements */

Cheers,

Gilles

On Wed, Dec 7, 2016 at 7:38 PM, Christof Koehler

Post by Christof Koehler
Hello everybody,
I am observing a deadlock in allreduce with openmpi 2.0.1 on a Single
node. A stack tracke (pstack) of one rank is below showing the program (vasp
In fact, the vasp input is not ok and it should abort at the point where
it hangs. It does when using mvapich 2.2. With openmpi 2.0.1 it just
deadlocks in some allreduce operation. Originally it was started with 20
ranks, when it hangs there are only 19 left. From the PIDs I would
assume it is the master rank which is missing. So, this looks like a
failure to terminate.
With 1.10 I get a clean
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 18789 on node node109
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Any ideas what to try ? Of course in this situation it may well be the
program. Still, with the observed difference between 2.0.1 and 1.10 (and
mvapich) this might be interesting to someone.
Best Regards
Christof
#0 0x00002ad35b1562c3 in epoll_wait () from /lib64/libc.so.6
#1 0x00002ad35d114f42 in epoll_dispatch () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2 0x00002ad35d116751 in opal_libevent2022_event_base_loop () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#3 0x00002ad35d16e996 in progress_engine () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#4 0x00002ad359efbdc5 in start_thread () from /lib64/libpthread.so.0
#5 0x00002ad35b155ced in clone () from /lib64/libc.so.6
#0 0x00002ad35b14b69d in poll () from /lib64/libc.so.6 #1 0x00002ad35d11dc42 in poll_dispatch () from
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2 0x00002ad35d116751 in opal_libevent2022_event_base_loop () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#3 0x00002ad35d0c61d1 in progress_engine () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#4 0x00002ad359efbdc5 in start_thread () from /lib64/libpthread.so.0
#5 0x00002ad35b155ced in clone () from /lib64/libc.so.6
#0 0x00002ad35b14b69d in poll () from /lib64/libc.so.6
#1 0x00002ad35d11dc42 in poll_dispatch () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2 0x00002ad35d116751 in opal_libevent2022_event_base_loop () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#3 0x00002ad35d0c28cf in opal_progress () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#4 0x00002ad35adce8d8 in ompi_request_wait_completion () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#5 0x00002ad35adce838 in mca_pml_cm_recv () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#6 0x00002ad35ad4da42 in ompi_coll_base_allreduce_intra_recursivedoubling () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#7 0x00002ad35ad52906 in ompi_coll_tuned_allreduce_intra_dec_fixed () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#8 0x00002ad35ad1f0f4 in PMPI_Allreduce () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#9 0x00002ad35aa99c38 in pmpi_allreduce__ () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi_mpifh.so.20
#10 0x000000000045f8c6 in m_sum_i_ ()
#11 0x0000000000e1ce69 in mlwf_mp_mlwf_wannier90_ ()
#12 0x00000000004331ff in vamp () at main.F:2640
#13 0x000000000040ea1e in main ()
#14 0x00002ad35b080b15 in __libc_start_main () from /lib64/libc.so.6
#15 0x000000000040e929 in _start ()
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Christof Koehler

2016-12-07 13:07:27 UTC

Permalink

Hello,

thank you for the fast answer.

Post by Gilles Gouaillardet
Christoph,
can you please try again with
mpirun --mca btl tcp,self --mca pml ob1 ...

mpirun -n 20 --mca btl tcp,self --mca pml ob1 /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi

Deadlocks/ hangs, has no effect.

Post by Gilles Gouaillardet
mpirun --mca btl tcp,self --mca pml ob1 --mca coll ^tuned ...

mpirun -n 20 --mca btl tcp,self --mca pml ob1 --mca coll ^tuned /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi

Deadlocks/ hangs, has no effect. There is additional output.

wannier90 error: examine the output/error file for details
[node109][[55572,1],16][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer
(104)[node109][[55572,1],8][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node109][[55572,1],4][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node109][[55572,1],1][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node109][[55572,1],2][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)

Please note: The "wannier90 error: examine the output/error file for
details" is expected, there is in fact an error in the input file. It
is supposed to terminate.

However, with mvapich2 and openmpi 1.10.4 it terminates
completely, i.e. I get my shell prompt back. If a segfault is involved with
mvapich2 (as is apparently the case with openmpi 1.10.4 based in the
termination message) I do not know. I tried

export MV2_DEBUG_SHOW_BACKTRACE=1
mpirun -n 20 /cluster/vasp/5.3.5/intel2016/mvapich2-2.2/bin/vasp-mpi

but did not get any indication of a problem (segfault), the last lines
are

calculate QP shifts <psi_nk| G(iteration)W_0 |psi_nk>: iteration 1
writing wavefunctions
wannier90 error: examine the output/error file for details
node109 14:00 /scratch/ckoe/gw %

The last line is my shell prompt.

Post by Gilles Gouaillardet
if everything fails, can you describe of MPI_Allreduce is invoked ?
/* number of tasks, datatype, number of elements */

Difficult, this is not our code in the first place [1] and the problem
occurs when using an ("officially" supported) third party library [2].

From the stack trace of the hanging process the vasp routine which calls
allreduce is "m_sum_i_". That is in the mpi.F source file. Allreduce is
called as

CALL MPI_ALLREDUCE( MPI_IN_PLACE, ivec(1), n, MPI_INTEGER, &
& MPI_SUM, COMM%MPI_COMM, ierror )

n and ivec(1) are data type integer. It was originally with 20 ranks, I
tried 2 ranks now also and it hangs, too. With one (!) rank

mpirun -n 1 --mca btl tcp,self --mca pml ob1 --mca coll ^tuned /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi

I of course get a shell prompt back.

I then started in normally in the shell with 2 ranks
mpirun -n 2 --mca btl tcp,self --mca pml ob1 --mca coll ^tuned /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi
and attached gdb to the rank with the lowest pid (3478). I do not get a prompt
back (it hangs), the second rank 3479 is still at 100 % CPU and mpirun is still a process
I can see with "ps", but gdb says
(gdb) continue <- that is where I attached it !
Continuing.
[Thread 0x2b8366806700 (LWP 3480) exited]
[Thread 0x2b835da1c040 (LWP 3478) exited]
[Inferior 1 (process 3478) exited normally]
(gdb) bt
No stack.

So, as far as gdb is concerned the rank with the lowest pid (which is
gone while the other rank is still eating CPU time) terminated normally
?

I hope this helps. I have only very basic experience with debuggers
(never needed them really) and even less with using them in parallel.
I can try to catch the contents of ivec, but I do not think that would
be helpful ? If you need them I can try of course, I have no idea hwo
large the vector is.

Best Regards

Christof

[1] https://www.vasp.at/
[2] http://www.wannier.org/, Old version 1.2

Post by Gilles Gouaillardet
Cheers,
Gilles
On Wed, Dec 7, 2016 at 7:38 PM, Christof Koehler

Christof Koehler

2016-12-07 13:17:52 UTC

Permalink

Hello again,

attaching the gdb to mpirun the back trace when it hangs is
(gdb) bt
#0 0x00002b039f74169d in poll () from /usr/lib64/libc.so.6
#1 0x00002b039e1a9c42 in poll_dispatch () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2 0x00002b039e1a2751 in opal_libevent2022_event_base_loop () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#3 0x00000000004056ef in orterun (argc=13, argv=0x7ffef20a79f8) at orterun.c:1057
#4 0x00000000004035a0 in main (argc=13, argv=0x7ffef20a79f8) at main.c:13

Using pstack on mpirun I see several threads, below

Thread 5 (Thread 0x2b03a33b0700 (LWP 11691)):
#0 0x00002b039f743413 in select () from /usr/lib64/libc.so.6
#1 0x00002b039c599979 in listen_thread () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-rte.so.20
#2 0x00002b039defedc5 in start_thread () from /usr/lib64/libpthread.so.0
#3 0x00002b039f74bced in clone () from /usr/lib64/libc.so.6
Thread 4 (Thread 0x2b03a3be9700 (LWP 11692)):
#0 0x00002b039f74c2c3 in epoll_wait () from /usr/lib64/libc.so.6
#1 0x00002b039e1a0f42 in epoll_dispatch () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2 0x00002b039e1a2751 in opal_libevent2022_event_base_loop () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#3 0x00002b039e1fa996 in progress_engine () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#4 0x00002b039defedc5 in start_thread () from /usr/lib64/libpthread.so.0
#5 0x00002b039f74bced in clone () from /usr/lib64/libc.so.6
Thread 3 (Thread 0x2b03a3dea700 (LWP 11693)):
#0 0x00002b039f743413 in select () from /usr/lib64/libc.so.6
#1 0x00002b039e1f3a5f in listen_thread () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2 0x00002b039defedc5 in start_thread () from /usr/lib64/libpthread.so.0
#3 0x00002b039f74bced in clone () from /usr/lib64/libc.so.6
Thread 2 (Thread 0x2b03a3feb700 (LWP 11694)):
#0 0x00002b039f743413 in select () from /usr/lib64/libc.so.6
#1 0x00002b039c55616b in listen_thread_fn () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-rte.so.20
#2 0x00002b039defedc5 in start_thread () from /usr/lib64/libpthread.so.0
#3 0x00002b039f74bced in clone () from /usr/lib64/libc.so.6
Thread 1 (Thread 0x2b039c324100 (LWP 11690)):
#0 0x00002b039f74169d in poll () from /usr/lib64/libc.so.6
#1 0x00002b039e1a9c42 in poll_dispatch () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2 0x00002b039e1a2751 in opal_libevent2022_event_base_loop () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#3 0x00000000004056ef in orterun (argc=13, argv=0x7ffef20a79f8) at orterun.c:1057
#4 0x00000000004035a0 in main (argc=13, argv=0x7ffef20a79f8) at main.c:13

Best Regards

Christof

Post by Christof Koehler
Hello,
thank you for the fast answer.

Post by Gilles Gouaillardet
Christoph,
can you please try again with
mpirun --mca btl tcp,self --mca pml ob1 ...

mpirun -n 20 --mca btl tcp,self --mca pml ob1 /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi
Deadlocks/ hangs, has no effect.

Post by Gilles Gouaillardet
mpirun --mca btl tcp,self --mca pml ob1 --mca coll ^tuned ...

mpirun -n 20 --mca btl tcp,self --mca pml ob1 --mca coll ^tuned /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi
Deadlocks/ hangs, has no effect. There is additional output.
wannier90 error: examine the output/error file for details
[node109][[55572,1],16][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer
(104)[node109][[55572,1],8][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node109][[55572,1],4][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node109][[55572,1],1][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node109][[55572,1],2][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
Please note: The "wannier90 error: examine the output/error file for
details" is expected, there is in fact an error in the input file. It
is supposed to terminate.
However, with mvapich2 and openmpi 1.10.4 it terminates
completely, i.e. I get my shell prompt back. If a segfault is involved with
mvapich2 (as is apparently the case with openmpi 1.10.4 based in the
termination message) I do not know. I tried
export MV2_DEBUG_SHOW_BACKTRACE=1
mpirun -n 20 /cluster/vasp/5.3.5/intel2016/mvapich2-2.2/bin/vasp-mpi
but did not get any indication of a problem (segfault), the last lines
are
calculate QP shifts <psi_nk| G(iteration)W_0 |psi_nk>: iteration 1
writing wavefunctions
wannier90 error: examine the output/error file for details
node109 14:00 /scratch/ckoe/gw %
The last line is my shell prompt.

Post by Gilles Gouaillardet
if everything fails, can you describe of MPI_Allreduce is invoked ?
/* number of tasks, datatype, number of elements */

Difficult, this is not our code in the first place [1] and the problem
occurs when using an ("officially" supported) third party library [2].
From the stack trace of the hanging process the vasp routine which calls
allreduce is "m_sum_i_". That is in the mpi.F source file. Allreduce is
called as
CALL MPI_ALLREDUCE( MPI_IN_PLACE, ivec(1), n, MPI_INTEGER, &
& MPI_SUM, COMM%MPI_COMM, ierror )
n and ivec(1) are data type integer. It was originally with 20 ranks, I
tried 2 ranks now also and it hangs, too. With one (!) rank
mpirun -n 1 --mca btl tcp,self --mca pml ob1 --mca coll ^tuned /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi
I of course get a shell prompt back.
I then started in normally in the shell with 2 ranks
mpirun -n 2 --mca btl tcp,self --mca pml ob1 --mca coll ^tuned /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi
and attached gdb to the rank with the lowest pid (3478). I do not get a prompt
back (it hangs), the second rank 3479 is still at 100 % CPU and mpirun is still a process
I can see with "ps", but gdb says
(gdb) continue <- that is where I attached it !
Continuing.
[Thread 0x2b8366806700 (LWP 3480) exited]
[Thread 0x2b835da1c040 (LWP 3478) exited]
[Inferior 1 (process 3478) exited normally]
(gdb) bt
No stack.
So, as far as gdb is concerned the rank with the lowest pid (which is
gone while the other rank is still eating CPU time) terminated normally
?
I hope this helps. I have only very basic experience with debuggers
(never needed them really) and even less with using them in parallel.
I can try to catch the contents of ivec, but I do not think that would
be helpful ? If you need them I can try of course, I have no idea hwo
large the vector is.
Best Regards
Christof
[1] https://www.vasp.at/
[2] http://www.wannier.org/, Old version 1.2

Post by Gilles Gouaillardet
Cheers,
Gilles
On Wed, Dec 7, 2016 at 7:38 PM, Christof Koehler

--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/

Christof Koehler

2016-12-07 15:07:46 UTC

Permalink

Hello,

Christof,
out of curiosity, can you run
dmesg
and see if you find some tasks killed by the oom-killer ?

Definitively not the oom-killer. It is a real tiny example. I checked
the machines logfile and dmesg.

the error message you see is a consequence of a task unexpectedly died.
and there is no evidence the task crashed or was killed.

Yes, confusing isn't it ?

when you observe a hang with two tasks, you can
- retrieve the pids with ps
- run 'pstack <pid>' on both pids in order to collect the stacktrace.

When it hangs one is already gone ! The pstack traces I sent are from the
surviver(s). It is not terminating completely as it should do.

assuming they both hang in MPI_Allreduce(), the relevant part to us is
- datatype (MPI_INT)
- count (n)
- communicator (COMM%MPI_COMM) (size, check this is the same communicator
used by all tasks)
- is all the buffer accessible (ivec(1:n))

As I said, the root rank terminates (according to gdb normally). The other
remains and hangs in allreduce. Possibly because its partner (the
root rank) is gone without saying goodbye properly.

This is not a real hang IMO, but a failure to terminate all ranks cleanly.

I really think the hang is a consequence of
unclean termination (in the sense that the non-root ranks are not
terminated) and probably not the cause, in my interpretation of what I
see. Would you have any suggestion to catch signals sent between orterun
(mpirun) and the child tasks ?

I will try to get the information you want, but I will have to figure
out how to do that first.

Cheers

Christof

Cheers,
Gilles
On Wednesday, December 7, 2016, Christof Koehler <

Post by Christof Koehler
Hello,
thank you for the fast answer.

Post by Gilles Gouaillardet
Christoph,
can you please try again with
mpirun --mca btl tcp,self --mca pml ob1 ...

mpirun -n 20 --mca btl tcp,self --mca pml ob1
/cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi
Deadlocks/ hangs, has no effect.

Post by Gilles Gouaillardet
mpirun --mca btl tcp,self --mca pml ob1 --mca coll ^tuned ...

mpirun -n 20 --mca btl tcp,self --mca pml ob1 --mca coll ^tuned
/cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi
Deadlocks/ hangs, has no effect. There is additional output.
wannier90 error: examine the output/error file for details
[node109][[55572,1],16][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer
(104)[node109][[55572,1],8][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node109][[55572,1],4][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node109][[55572,1],1][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node109][[55572,1],2][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
Please note: The "wannier90 error: examine the output/error file for
details" is expected, there is in fact an error in the input file. It
is supposed to terminate.
However, with mvapich2 and openmpi 1.10.4 it terminates
completely, i.e. I get my shell prompt back. If a segfault is involved with
mvapich2 (as is apparently the case with openmpi 1.10.4 based in the
termination message) I do not know. I tried
export MV2_DEBUG_SHOW_BACKTRACE=1
mpirun -n 20 /cluster/vasp/5.3.5/intel2016/mvapich2-2.2/bin/vasp-mpi
but did not get any indication of a problem (segfault), the last lines
are
calculate QP shifts <psi_nk| G(iteration)W_0 |psi_nk>: iteration 1
writing wavefunctions
wannier90 error: examine the output/error file for details
node109 14:00 /scratch/ckoe/gw %
The last line is my shell prompt.

Post by Gilles Gouaillardet
if everything fails, can you describe of MPI_Allreduce is invoked ?
/* number of tasks, datatype, number of elements */

Difficult, this is not our code in the first place [1] and the problem
occurs when using an ("officially" supported) third party library [2].
From the stack trace of the hanging process the vasp routine which calls
allreduce is "m_sum_i_". That is in the mpi.F source file. Allreduce is
called as
CALL MPI_ALLREDUCE( MPI_IN_PLACE, ivec(1), n, MPI_INTEGER, &
& MPI_SUM, COMM%MPI_COMM, ierror )
n and ivec(1) are data type integer. It was originally with 20 ranks, I
tried 2 ranks now also and it hangs, too. With one (!) rank
mpirun -n 1 --mca btl tcp,self --mca pml ob1 --mca coll ^tuned
/cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi
I of course get a shell prompt back.
I then started in normally in the shell with 2 ranks
mpirun -n 2 --mca btl tcp,self --mca pml ob1 --mca coll ^tuned
/cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi
and attached gdb to the rank with the lowest pid (3478). I do not get a prompt
back (it hangs), the second rank 3479 is still at 100 % CPU and mpirun is
still a process
I can see with "ps", but gdb says
(gdb) continue <- that is where I attached it !
Continuing.
[Thread 0x2b8366806700 (LWP 3480) exited]
[Thread 0x2b835da1c040 (LWP 3478) exited]
[Inferior 1 (process 3478) exited normally]
(gdb) bt
No stack.
So, as far as gdb is concerned the rank with the lowest pid (which is
gone while the other rank is still eating CPU time) terminated normally
?
I hope this helps. I have only very basic experience with debuggers
(never needed them really) and even less with using them in parallel.
I can try to catch the contents of ivec, but I do not think that would
be helpful ? If you need them I can try of course, I have no idea hwo
large the vector is.
Best Regards
Christof
[1] https://www.vasp.at/
[2] http://www.wannier.org/, Old version 1.2

Post by Gilles Gouaillardet
Cheers,
Gilles
On Wed, Dec 7, 2016 at 7:38 PM, Christof Koehler

Post by Christof Koehler
Hello everybody,
I am observing a deadlock in allreduce with openmpi 2.0.1 on a Single
node. A stack tracke (pstack) of one rank is below showing the program

(vasp

Post by Gilles Gouaillardet

Post by Christof Koehler
In fact, the vasp input is not ok and it should abort at the point

where

Post by Gilles Gouaillardet

Post by Christof Koehler
it hangs. It does when using mvapich 2.2. With openmpi 2.0.1 it just
deadlocks in some allreduce operation. Originally it was started with

Post by Gilles Gouaillardet

Post by Christof Koehler
ranks, when it hangs there are only 19 left. From the PIDs I would
assume it is the master rank which is missing. So, this looks like a
failure to terminate.
With 1.10 I get a clean
------------------------------------------------------------

--------------

Post by Gilles Gouaillardet

Post by Christof Koehler
mpiexec noticed that process rank 0 with PID 18789 on node node109
exited on signal 11 (Segmentation fault).
------------------------------------------------------------

--------------

Post by Gilles Gouaillardet

Post by Christof Koehler
Any ideas what to try ? Of course in this situation it may well be the
program. Still, with the observed difference between 2.0.1 and 1.10

(and

Post by Gilles Gouaillardet

Post by Christof Koehler
mvapich) this might be interesting to someone.
Best Regards
Christof
#0 0x00002ad35b1562c3 in epoll_wait () from /lib64/libc.so.6
#1 0x00002ad35d114f42 in epoll_dispatch () from