Discussion:
[OMPI users] Abort/ Deadlock issue in allreduce
Christof Koehler
2016-12-07 10:38:00 UTC
Permalink
Hello everybody,

I am observing a deadlock in allreduce with openmpi 2.0.1 on a Single
node. A stack tracke (pstack) of one rank is below showing the program (vasp
5.3.5) and the two psm2 progress threads. However:

In fact, the vasp input is not ok and it should abort at the point where
it hangs. It does when using mvapich 2.2. With openmpi 2.0.1 it just
deadlocks in some allreduce operation. Originally it was started with 20
ranks, when it hangs there are only 19 left. From the PIDs I would
assume it is the master rank which is missing. So, this looks like a
failure to terminate.

With 1.10 I get a clean
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 18789 on node node109
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Any ideas what to try ? Of course in this situation it may well be the
program. Still, with the observed difference between 2.0.1 and 1.10 (and
mvapich) this might be interesting to someone.

Best Regards

Christof


Thread 3 (Thread 0x2ad362577700 (LWP 4629)):
#0 0x00002ad35b1562c3 in epoll_wait () from /lib64/libc.so.6
#1 0x00002ad35d114f42 in epoll_dispatch () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2 0x00002ad35d116751 in opal_libevent2022_event_base_loop () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#3 0x00002ad35d16e996 in progress_engine () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#4 0x00002ad359efbdc5 in start_thread () from /lib64/libpthread.so.0
#5 0x00002ad35b155ced in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x2ad362778700 (LWP 4640)):
#0 0x00002ad35b14b69d in poll () from /lib64/libc.so.6 #1 0x00002ad35d11dc42 in poll_dispatch () from
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2 0x00002ad35d116751 in opal_libevent2022_event_base_loop () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#3 0x00002ad35d0c61d1 in progress_engine () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#4 0x00002ad359efbdc5 in start_thread () from /lib64/libpthread.so.0
#5 0x00002ad35b155ced in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x2ad35978d040 (LWP 4609)):
#0 0x00002ad35b14b69d in poll () from /lib64/libc.so.6
#1 0x00002ad35d11dc42 in poll_dispatch () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2 0x00002ad35d116751 in opal_libevent2022_event_base_loop () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#3 0x00002ad35d0c28cf in opal_progress () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#4 0x00002ad35adce8d8 in ompi_request_wait_completion () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#5 0x00002ad35adce838 in mca_pml_cm_recv () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#6 0x00002ad35ad4da42 in ompi_coll_base_allreduce_intra_recursivedoubling () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#7 0x00002ad35ad52906 in ompi_coll_tuned_allreduce_intra_dec_fixed () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#8 0x00002ad35ad1f0f4 in PMPI_Allreduce () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#9 0x00002ad35aa99c38 in pmpi_allreduce__ () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi_mpifh.so.20
#10 0x000000000045f8c6 in m_sum_i_ ()
#11 0x0000000000e1ce69 in mlwf_mp_mlwf_wannier90_ ()
#12 0x00000000004331ff in vamp () at main.F:2640
#13 0x000000000040ea1e in main ()
#14 0x00002ad35b080b15 in __libc_start_main () from /lib64/libc.so.6
#15 0x000000000040e929 in _start ()
--
Dr. rer. nat. Christof Köhler email: ***@bccms.uni-bremen.de
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen

PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
Gilles Gouaillardet
2016-12-07 11:23:43 UTC
Permalink
Christoph,

can you please try again with

mpirun --mca btl tcp,self --mca pml ob1 ...

that will help figuring out whether pml/cm and/or mtl/psm2 is involved or not.


if that causes a crash, then can you please try

mpirun --mca btl tcp,self --mca pml ob1 --mca coll ^tuned ...

that will help figuring out whether coll/tuned is involved or not

coll/tuned is known not to correctly handle collectives with different
but matching signatures
(e.g. some tasks invoke the collective with one vector of N elements,
and some other invoke
the same collective with N elements)


if everything fails, can you describe of MPI_Allreduce is invoked ?
/* number of tasks, datatype, number of elements */



Cheers,

Gilles

On Wed, Dec 7, 2016 at 7:38 PM, Christof Koehler
Post by Christof Koehler
Hello everybody,
I am observing a deadlock in allreduce with openmpi 2.0.1 on a Single
node. A stack tracke (pstack) of one rank is below showing the program (vasp
In fact, the vasp input is not ok and it should abort at the point where
it hangs. It does when using mvapich 2.2. With openmpi 2.0.1 it just
deadlocks in some allreduce operation. Originally it was started with 20
ranks, when it hangs there are only 19 left. From the PIDs I would
assume it is the master rank which is missing. So, this looks like a
failure to terminate.
With 1.10 I get a clean
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 18789 on node node109
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Any ideas what to try ? Of course in this situation it may well be the
program. Still, with the observed difference between 2.0.1 and 1.10 (and
mvapich) this might be interesting to someone.
Best Regards
Christof
#0 0x00002ad35b1562c3 in epoll_wait () from /lib64/libc.so.6
#1 0x00002ad35d114f42 in epoll_dispatch () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2 0x00002ad35d116751 in opal_libevent2022_event_base_loop () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#3 0x00002ad35d16e996 in progress_engine () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#4 0x00002ad359efbdc5 in start_thread () from /lib64/libpthread.so.0
#5 0x00002ad35b155ced in clone () from /lib64/libc.so.6
#0 0x00002ad35b14b69d in poll () from /lib64/libc.so.6 #1 0x00002ad35d11dc42 in poll_dispatch () from
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2 0x00002ad35d116751 in opal_libevent2022_event_base_loop () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#3 0x00002ad35d0c61d1 in progress_engine () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#4 0x00002ad359efbdc5 in start_thread () from /lib64/libpthread.so.0
#5 0x00002ad35b155ced in clone () from /lib64/libc.so.6
#0 0x00002ad35b14b69d in poll () from /lib64/libc.so.6
#1 0x00002ad35d11dc42 in poll_dispatch () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2 0x00002ad35d116751 in opal_libevent2022_event_base_loop () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#3 0x00002ad35d0c28cf in opal_progress () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#4 0x00002ad35adce8d8 in ompi_request_wait_completion () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#5 0x00002ad35adce838 in mca_pml_cm_recv () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#6 0x00002ad35ad4da42 in ompi_coll_base_allreduce_intra_recursivedoubling () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#7 0x00002ad35ad52906 in ompi_coll_tuned_allreduce_intra_dec_fixed () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#8 0x00002ad35ad1f0f4 in PMPI_Allreduce () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#9 0x00002ad35aa99c38 in pmpi_allreduce__ () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi_mpifh.so.20
#10 0x000000000045f8c6 in m_sum_i_ ()
#11 0x0000000000e1ce69 in mlwf_mp_mlwf_wannier90_ ()
#12 0x00000000004331ff in vamp () at main.F:2640
#13 0x000000000040ea1e in main ()
#14 0x00002ad35b080b15 in __libc_start_main () from /lib64/libc.so.6
#15 0x000000000040e929 in _start ()
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Christof Koehler
2016-12-07 13:07:27 UTC
Permalink
Hello,

thank you for the fast answer.
Post by Gilles Gouaillardet
Christoph,
can you please try again with
mpirun --mca btl tcp,self --mca pml ob1 ...
mpirun -n 20 --mca btl tcp,self --mca pml ob1 /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi

Deadlocks/ hangs, has no effect.
Post by Gilles Gouaillardet
mpirun --mca btl tcp,self --mca pml ob1 --mca coll ^tuned ...
mpirun -n 20 --mca btl tcp,self --mca pml ob1 --mca coll ^tuned /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi

Deadlocks/ hangs, has no effect. There is additional output.

wannier90 error: examine the output/error file for details
[node109][[55572,1],16][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer
(104)[node109][[55572,1],8][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node109][[55572,1],4][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node109][[55572,1],1][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node109][[55572,1],2][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)

Please note: The "wannier90 error: examine the output/error file for
details" is expected, there is in fact an error in the input file. It
is supposed to terminate.

However, with mvapich2 and openmpi 1.10.4 it terminates
completely, i.e. I get my shell prompt back. If a segfault is involved with
mvapich2 (as is apparently the case with openmpi 1.10.4 based in the
termination message) I do not know. I tried

export MV2_DEBUG_SHOW_BACKTRACE=1
mpirun -n 20 /cluster/vasp/5.3.5/intel2016/mvapich2-2.2/bin/vasp-mpi

but did not get any indication of a problem (segfault), the last lines
are

calculate QP shifts <psi_nk| G(iteration)W_0 |psi_nk>: iteration 1
writing wavefunctions
wannier90 error: examine the output/error file for details
node109 14:00 /scratch/ckoe/gw %

The last line is my shell prompt.
Post by Gilles Gouaillardet
if everything fails, can you describe of MPI_Allreduce is invoked ?
/* number of tasks, datatype, number of elements */
Difficult, this is not our code in the first place [1] and the problem
occurs when using an ("officially" supported) third party library [2].

From the stack trace of the hanging process the vasp routine which calls
allreduce is "m_sum_i_". That is in the mpi.F source file. Allreduce is
called as

CALL MPI_ALLREDUCE( MPI_IN_PLACE, ivec(1), n, MPI_INTEGER, &
& MPI_SUM, COMM%MPI_COMM, ierror )

n and ivec(1) are data type integer. It was originally with 20 ranks, I
tried 2 ranks now also and it hangs, too. With one (!) rank

mpirun -n 1 --mca btl tcp,self --mca pml ob1 --mca coll ^tuned /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi

I of course get a shell prompt back.

I then started in normally in the shell with 2 ranks
mpirun -n 2 --mca btl tcp,self --mca pml ob1 --mca coll ^tuned /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi
and attached gdb to the rank with the lowest pid (3478). I do not get a prompt
back (it hangs), the second rank 3479 is still at 100 % CPU and mpirun is still a process
I can see with "ps", but gdb says
(gdb) continue <- that is where I attached it !
Continuing.
[Thread 0x2b8366806700 (LWP 3480) exited]
[Thread 0x2b835da1c040 (LWP 3478) exited]
[Inferior 1 (process 3478) exited normally]
(gdb) bt
No stack.

So, as far as gdb is concerned the rank with the lowest pid (which is
gone while the other rank is still eating CPU time) terminated normally
?

I hope this helps. I have only very basic experience with debuggers
(never needed them really) and even less with using them in parallel.
I can try to catch the contents of ivec, but I do not think that would
be helpful ? If you need them I can try of course, I have no idea hwo
large the vector is.


Best Regards

Christof

[1] https://www.vasp.at/
[2] http://www.wannier.org/, Old version 1.2
Post by Gilles Gouaillardet
Cheers,
Gilles
On Wed, Dec 7, 2016 at 7:38 PM, Christof Koehler
Post by Christof Koehler
Hello everybody,
I am observing a deadlock in allreduce with openmpi 2.0.1 on a Single
node. A stack tracke (pstack) of one rank is below showing the program (vasp
In fact, the vasp input is not ok and it should abort at the point where
it hangs. It does when using mvapich 2.2. With openmpi 2.0.1 it just
deadlocks in some allreduce operation. Originally it was started with 20
ranks, when it hangs there are only 19 left. From the PIDs I would
assume it is the master rank which is missing. So, this looks like a
failure to terminate.
With 1.10 I get a clean
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 18789 on node node109
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Any ideas what to try ? Of course in this situation it may well be the
program. Still, with the observed difference between 2.0.1 and 1.10 (and
mvapich) this might be interesting to someone.
Best Regards
Christof
#0 0x00002ad35b1562c3 in epoll_wait () from /lib64/libc.so.6
#1 0x00002ad35d114f42 in epoll_dispatch () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2 0x00002ad35d116751 in opal_libevent2022_event_base_loop () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#3 0x00002ad35d16e996 in progress_engine () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#4 0x00002ad359efbdc5 in start_thread () from /lib64/libpthread.so.0
#5 0x00002ad35b155ced in clone () from /lib64/libc.so.6
#0 0x00002ad35b14b69d in poll () from /lib64/libc.so.6 #1 0x00002ad35d11dc42 in poll_dispatch () from
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2 0x00002ad35d116751 in opal_libevent2022_event_base_loop () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#3 0x00002ad35d0c61d1 in progress_engine () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#4 0x00002ad359efbdc5 in start_thread () from /lib64/libpthread.so.0
#5 0x00002ad35b155ced in clone () from /lib64/libc.so.6
#0 0x00002ad35b14b69d in poll () from /lib64/libc.so.6
#1 0x00002ad35d11dc42 in poll_dispatch () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2 0x00002ad35d116751 in opal_libevent2022_event_base_loop () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#3 0x00002ad35d0c28cf in opal_progress () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#4 0x00002ad35adce8d8 in ompi_request_wait_completion () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#5 0x00002ad35adce838 in mca_pml_cm_recv () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#6 0x00002ad35ad4da42 in ompi_coll_base_allreduce_intra_recursivedoubling () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#7 0x00002ad35ad52906 in ompi_coll_tuned_allreduce_intra_dec_fixed () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#8 0x00002ad35ad1f0f4 in PMPI_Allreduce () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#9 0x00002ad35aa99c38 in pmpi_allreduce__ () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi_mpifh.so.20
#10 0x000000000045f8c6 in m_sum_i_ ()
#11 0x0000000000e1ce69 in mlwf_mp_mlwf_wannier90_ ()
#12 0x00000000004331ff in vamp () at main.F:2640
#13 0x000000000040ea1e in main ()
#14 0x00002ad35b080b15 in __libc_start_main () from /lib64/libc.so.6
#15 0x000000000040e929 in _start ()
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Dr. rer. nat. Christof Köhler email: ***@bccms.uni-bremen.de
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen

PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
Christof Koehler
2016-12-07 13:17:52 UTC
Permalink
Hello again,

attaching the gdb to mpirun the back trace when it hangs is
(gdb) bt
#0 0x00002b039f74169d in poll () from /usr/lib64/libc.so.6
#1 0x00002b039e1a9c42 in poll_dispatch () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2 0x00002b039e1a2751 in opal_libevent2022_event_base_loop () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#3 0x00000000004056ef in orterun (argc=13, argv=0x7ffef20a79f8) at orterun.c:1057
#4 0x00000000004035a0 in main (argc=13, argv=0x7ffef20a79f8) at main.c:13

Using pstack on mpirun I see several threads, below

Thread 5 (Thread 0x2b03a33b0700 (LWP 11691)):
#0 0x00002b039f743413 in select () from /usr/lib64/libc.so.6
#1 0x00002b039c599979 in listen_thread () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-rte.so.20
#2 0x00002b039defedc5 in start_thread () from /usr/lib64/libpthread.so.0
#3 0x00002b039f74bced in clone () from /usr/lib64/libc.so.6
Thread 4 (Thread 0x2b03a3be9700 (LWP 11692)):
#0 0x00002b039f74c2c3 in epoll_wait () from /usr/lib64/libc.so.6
#1 0x00002b039e1a0f42 in epoll_dispatch () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2 0x00002b039e1a2751 in opal_libevent2022_event_base_loop () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#3 0x00002b039e1fa996 in progress_engine () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#4 0x00002b039defedc5 in start_thread () from /usr/lib64/libpthread.so.0
#5 0x00002b039f74bced in clone () from /usr/lib64/libc.so.6
Thread 3 (Thread 0x2b03a3dea700 (LWP 11693)):
#0 0x00002b039f743413 in select () from /usr/lib64/libc.so.6
#1 0x00002b039e1f3a5f in listen_thread () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2 0x00002b039defedc5 in start_thread () from /usr/lib64/libpthread.so.0
#3 0x00002b039f74bced in clone () from /usr/lib64/libc.so.6
Thread 2 (Thread 0x2b03a3feb700 (LWP 11694)):
#0 0x00002b039f743413 in select () from /usr/lib64/libc.so.6
#1 0x00002b039c55616b in listen_thread_fn () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-rte.so.20
#2 0x00002b039defedc5 in start_thread () from /usr/lib64/libpthread.so.0
#3 0x00002b039f74bced in clone () from /usr/lib64/libc.so.6
Thread 1 (Thread 0x2b039c324100 (LWP 11690)):
#0 0x00002b039f74169d in poll () from /usr/lib64/libc.so.6
#1 0x00002b039e1a9c42 in poll_dispatch () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2 0x00002b039e1a2751 in opal_libevent2022_event_base_loop () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#3 0x00000000004056ef in orterun (argc=13, argv=0x7ffef20a79f8) at orterun.c:1057
#4 0x00000000004035a0 in main (argc=13, argv=0x7ffef20a79f8) at main.c:13

Best Regards

Christof
Post by Christof Koehler
Hello,
thank you for the fast answer.
Post by Gilles Gouaillardet
Christoph,
can you please try again with
mpirun --mca btl tcp,self --mca pml ob1 ...
mpirun -n 20 --mca btl tcp,self --mca pml ob1 /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi
Deadlocks/ hangs, has no effect.
Post by Gilles Gouaillardet
mpirun --mca btl tcp,self --mca pml ob1 --mca coll ^tuned ...
mpirun -n 20 --mca btl tcp,self --mca pml ob1 --mca coll ^tuned /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi
Deadlocks/ hangs, has no effect. There is additional output.
wannier90 error: examine the output/error file for details
[node109][[55572,1],16][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer
(104)[node109][[55572,1],8][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node109][[55572,1],4][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node109][[55572,1],1][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node109][[55572,1],2][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
Please note: The "wannier90 error: examine the output/error file for
details" is expected, there is in fact an error in the input file. It
is supposed to terminate.
However, with mvapich2 and openmpi 1.10.4 it terminates
completely, i.e. I get my shell prompt back. If a segfault is involved with
mvapich2 (as is apparently the case with openmpi 1.10.4 based in the
termination message) I do not know. I tried
export MV2_DEBUG_SHOW_BACKTRACE=1
mpirun -n 20 /cluster/vasp/5.3.5/intel2016/mvapich2-2.2/bin/vasp-mpi
but did not get any indication of a problem (segfault), the last lines
are
calculate QP shifts <psi_nk| G(iteration)W_0 |psi_nk>: iteration 1
writing wavefunctions
wannier90 error: examine the output/error file for details
node109 14:00 /scratch/ckoe/gw %
The last line is my shell prompt.
Post by Gilles Gouaillardet
if everything fails, can you describe of MPI_Allreduce is invoked ?
/* number of tasks, datatype, number of elements */
Difficult, this is not our code in the first place [1] and the problem
occurs when using an ("officially" supported) third party library [2].
From the stack trace of the hanging process the vasp routine which calls
allreduce is "m_sum_i_". That is in the mpi.F source file. Allreduce is
called as
CALL MPI_ALLREDUCE( MPI_IN_PLACE, ivec(1), n, MPI_INTEGER, &
& MPI_SUM, COMM%MPI_COMM, ierror )
n and ivec(1) are data type integer. It was originally with 20 ranks, I
tried 2 ranks now also and it hangs, too. With one (!) rank
mpirun -n 1 --mca btl tcp,self --mca pml ob1 --mca coll ^tuned /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi
I of course get a shell prompt back.
I then started in normally in the shell with 2 ranks
mpirun -n 2 --mca btl tcp,self --mca pml ob1 --mca coll ^tuned /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi
and attached gdb to the rank with the lowest pid (3478). I do not get a prompt
back (it hangs), the second rank 3479 is still at 100 % CPU and mpirun is still a process
I can see with "ps", but gdb says
(gdb) continue <- that is where I attached it !
Continuing.
[Thread 0x2b8366806700 (LWP 3480) exited]
[Thread 0x2b835da1c040 (LWP 3478) exited]
[Inferior 1 (process 3478) exited normally]
(gdb) bt
No stack.
So, as far as gdb is concerned the rank with the lowest pid (which is
gone while the other rank is still eating CPU time) terminated normally
?
I hope this helps. I have only very basic experience with debuggers
(never needed them really) and even less with using them in parallel.
I can try to catch the contents of ivec, but I do not think that would
be helpful ? If you need them I can try of course, I have no idea hwo
large the vector is.
Best Regards
Christof
[1] https://www.vasp.at/
[2] http://www.wannier.org/, Old version 1.2
Post by Gilles Gouaillardet
Cheers,
Gilles
On Wed, Dec 7, 2016 at 7:38 PM, Christof Koehler
Post by Christof Koehler
Hello everybody,
I am observing a deadlock in allreduce with openmpi 2.0.1 on a Single
node. A stack tracke (pstack) of one rank is below showing the program (vasp
In fact, the vasp input is not ok and it should abort at the point where
it hangs. It does when using mvapich 2.2. With openmpi 2.0.1 it just
deadlocks in some allreduce operation. Originally it was started with 20
ranks, when it hangs there are only 19 left. From the PIDs I would
assume it is the master rank which is missing. So, this looks like a
failure to terminate.
With 1.10 I get a clean
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 18789 on node node109
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Any ideas what to try ? Of course in this situation it may well be the
program. Still, with the observed difference between 2.0.1 and 1.10 (and
mvapich) this might be interesting to someone.
Best Regards
Christof
#0 0x00002ad35b1562c3 in epoll_wait () from /lib64/libc.so.6
#1 0x00002ad35d114f42 in epoll_dispatch () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2 0x00002ad35d116751 in opal_libevent2022_event_base_loop () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#3 0x00002ad35d16e996 in progress_engine () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#4 0x00002ad359efbdc5 in start_thread () from /lib64/libpthread.so.0
#5 0x00002ad35b155ced in clone () from /lib64/libc.so.6
#0 0x00002ad35b14b69d in poll () from /lib64/libc.so.6 #1 0x00002ad35d11dc42 in poll_dispatch () from
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2 0x00002ad35d116751 in opal_libevent2022_event_base_loop () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#3 0x00002ad35d0c61d1 in progress_engine () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#4 0x00002ad359efbdc5 in start_thread () from /lib64/libpthread.so.0
#5 0x00002ad35b155ced in clone () from /lib64/libc.so.6
#0 0x00002ad35b14b69d in poll () from /lib64/libc.so.6
#1 0x00002ad35d11dc42 in poll_dispatch () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2 0x00002ad35d116751 in opal_libevent2022_event_base_loop () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#3 0x00002ad35d0c28cf in opal_progress () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#4 0x00002ad35adce8d8 in ompi_request_wait_completion () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#5 0x00002ad35adce838 in mca_pml_cm_recv () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#6 0x00002ad35ad4da42 in ompi_coll_base_allreduce_intra_recursivedoubling () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#7 0x00002ad35ad52906 in ompi_coll_tuned_allreduce_intra_dec_fixed () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#8 0x00002ad35ad1f0f4 in PMPI_Allreduce () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#9 0x00002ad35aa99c38 in pmpi_allreduce__ () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi_mpifh.so.20
#10 0x000000000045f8c6 in m_sum_i_ ()
#11 0x0000000000e1ce69 in mlwf_mp_mlwf_wannier90_ ()
#12 0x00000000004331ff in vamp () at main.F:2640
#13 0x000000000040ea1e in main ()
#14 0x00002ad35b080b15 in __libc_start_main () from /lib64/libc.so.6
#15 0x000000000040e929 in _start ()
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
--
Dr. rer. nat. Christof Köhler email: ***@bccms.uni-bremen.de
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen

PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
Christof Koehler
2016-12-07 15:07:46 UTC
Permalink
Hello,
Christof,
out of curiosity, can you run
dmesg
and see if you find some tasks killed by the oom-killer ?
Definitively not the oom-killer. It is a real tiny example. I checked
the machines logfile and dmesg.
the error message you see is a consequence of a task unexpectedly died.
and there is no evidence the task crashed or was killed.
Yes, confusing isn't it ?
when you observe a hang with two tasks, you can
- retrieve the pids with ps
- run 'pstack <pid>' on both pids in order to collect the stacktrace.
When it hangs one is already gone ! The pstack traces I sent are from the
surviver(s). It is not terminating completely as it should do.
assuming they both hang in MPI_Allreduce(), the relevant part to us is
- datatype (MPI_INT)
- count (n)
- communicator (COMM%MPI_COMM) (size, check this is the same communicator
used by all tasks)
- is all the buffer accessible (ivec(1:n))
As I said, the root rank terminates (according to gdb normally). The other
remains and hangs in allreduce. Possibly because its partner (the
root rank) is gone without saying goodbye properly.

This is not a real hang IMO, but a failure to terminate all ranks cleanly.

I really think the hang is a consequence of
unclean termination (in the sense that the non-root ranks are not
terminated) and probably not the cause, in my interpretation of what I
see. Would you have any suggestion to catch signals sent between orterun
(mpirun) and the child tasks ?

I will try to get the information you want, but I will have to figure
out how to do that first.

Cheers

Christof
Cheers,
Gilles
On Wednesday, December 7, 2016, Christof Koehler <
Post by Christof Koehler
Hello,
thank you for the fast answer.
Post by Gilles Gouaillardet
Christoph,
can you please try again with
mpirun --mca btl tcp,self --mca pml ob1 ...
mpirun -n 20 --mca btl tcp,self --mca pml ob1
/cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi
Deadlocks/ hangs, has no effect.
Post by Gilles Gouaillardet
mpirun --mca btl tcp,self --mca pml ob1 --mca coll ^tuned ...
mpirun -n 20 --mca btl tcp,self --mca pml ob1 --mca coll ^tuned
/cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi
Deadlocks/ hangs, has no effect. There is additional output.
wannier90 error: examine the output/error file for details
[node109][[55572,1],16][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer
(104)[node109][[55572,1],8][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node109][[55572,1],4][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node109][[55572,1],1][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node109][[55572,1],2][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
Please note: The "wannier90 error: examine the output/error file for
details" is expected, there is in fact an error in the input file. It
is supposed to terminate.
However, with mvapich2 and openmpi 1.10.4 it terminates
completely, i.e. I get my shell prompt back. If a segfault is involved with
mvapich2 (as is apparently the case with openmpi 1.10.4 based in the
termination message) I do not know. I tried
export MV2_DEBUG_SHOW_BACKTRACE=1
mpirun -n 20 /cluster/vasp/5.3.5/intel2016/mvapich2-2.2/bin/vasp-mpi
but did not get any indication of a problem (segfault), the last lines
are
calculate QP shifts <psi_nk| G(iteration)W_0 |psi_nk>: iteration 1
writing wavefunctions
wannier90 error: examine the output/error file for details
node109 14:00 /scratch/ckoe/gw %
The last line is my shell prompt.
Post by Gilles Gouaillardet
if everything fails, can you describe of MPI_Allreduce is invoked ?
/* number of tasks, datatype, number of elements */
Difficult, this is not our code in the first place [1] and the problem
occurs when using an ("officially" supported) third party library [2].
From the stack trace of the hanging process the vasp routine which calls
allreduce is "m_sum_i_". That is in the mpi.F source file. Allreduce is
called as
CALL MPI_ALLREDUCE( MPI_IN_PLACE, ivec(1), n, MPI_INTEGER, &
& MPI_SUM, COMM%MPI_COMM, ierror )
n and ivec(1) are data type integer. It was originally with 20 ranks, I
tried 2 ranks now also and it hangs, too. With one (!) rank
mpirun -n 1 --mca btl tcp,self --mca pml ob1 --mca coll ^tuned
/cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi
I of course get a shell prompt back.
I then started in normally in the shell with 2 ranks
mpirun -n 2 --mca btl tcp,self --mca pml ob1 --mca coll ^tuned
/cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi
and attached gdb to the rank with the lowest pid (3478). I do not get a prompt
back (it hangs), the second rank 3479 is still at 100 % CPU and mpirun is
still a process
I can see with "ps", but gdb says
(gdb) continue <- that is where I attached it !
Continuing.
[Thread 0x2b8366806700 (LWP 3480) exited]
[Thread 0x2b835da1c040 (LWP 3478) exited]
[Inferior 1 (process 3478) exited normally]
(gdb) bt
No stack.
So, as far as gdb is concerned the rank with the lowest pid (which is
gone while the other rank is still eating CPU time) terminated normally
?
I hope this helps. I have only very basic experience with debuggers
(never needed them really) and even less with using them in parallel.
I can try to catch the contents of ivec, but I do not think that would
be helpful ? If you need them I can try of course, I have no idea hwo
large the vector is.
Best Regards
Christof
[1] https://www.vasp.at/
[2] http://www.wannier.org/, Old version 1.2
Post by Gilles Gouaillardet
Cheers,
Gilles
On Wed, Dec 7, 2016 at 7:38 PM, Christof Koehler
Post by Christof Koehler
Hello everybody,
I am observing a deadlock in allreduce with openmpi 2.0.1 on a Single
node. A stack tracke (pstack) of one rank is below showing the program
(vasp
Post by Gilles Gouaillardet
Post by Christof Koehler
In fact, the vasp input is not ok and it should abort at the point
where
Post by Gilles Gouaillardet
Post by Christof Koehler
it hangs. It does when using mvapich 2.2. With openmpi 2.0.1 it just
deadlocks in some allreduce operation. Originally it was started with
20
Post by Gilles Gouaillardet
Post by Christof Koehler
ranks, when it hangs there are only 19 left. From the PIDs I would
assume it is the master rank which is missing. So, this looks like a
failure to terminate.
With 1.10 I get a clean
------------------------------------------------------------
--------------
Post by Gilles Gouaillardet
Post by Christof Koehler
mpiexec noticed that process rank 0 with PID 18789 on node node109
exited on signal 11 (Segmentation fault).
------------------------------------------------------------
--------------
Post by Gilles Gouaillardet
Post by Christof Koehler
Any ideas what to try ? Of course in this situation it may well be the
program. Still, with the observed difference between 2.0.1 and 1.10
(and
Post by Gilles Gouaillardet
Post by Christof Koehler
mvapich) this might be interesting to someone.
Best Regards
Christof
#0 0x00002ad35b1562c3 in epoll_wait () from /lib64/libc.so.6
#1 0x00002ad35d114f42 in epoll_dispatch () from
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
Post by Gilles Gouaillardet
Post by Christof Koehler
#2 0x00002ad35d116751 in opal_libevent2022_event_base_loop () from
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
Post by Gilles Gouaillardet
Post by Christof Koehler
#3 0x00002ad35d16e996 in progress_engine () from
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
Post by Gilles Gouaillardet
Post by Christof Koehler
#4 0x00002ad359efbdc5 in start_thread () from /lib64/libpthread.so.0
#5 0x00002ad35b155ced in clone () from /lib64/libc.so.6
#0 0x00002ad35b14b69d in poll () from /lib64/libc.so.6 #1
0x00002ad35d11dc42 in poll_dispatch () from
Post by Gilles Gouaillardet
Post by Christof Koehler
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2 0x00002ad35d116751 in opal_libevent2022_event_base_loop () from
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
Post by Gilles Gouaillardet
Post by Christof Koehler
#3 0x00002ad35d0c61d1 in progress_engine () from
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
Post by Gilles Gouaillardet
Post by Christof Koehler
#4 0x00002ad359efbdc5 in start_thread () from /lib64/libpthread.so.0
#5 0x00002ad35b155ced in clone () from /lib64/libc.so.6
#0 0x00002ad35b14b69d in poll () from /lib64/libc.so.6
#1 0x00002ad35d11dc42 in poll_dispatch () from
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
Post by Gilles Gouaillardet
Post by Christof Koehler
#2 0x00002ad35d116751 in opal_libevent2022_event_base_loop () from
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
Post by Gilles Gouaillardet
Post by Christof Koehler
#3 0x00002ad35d0c28cf in opal_progress () from
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
Post by Gilles Gouaillardet
Post by Christof Koehler
#4 0x00002ad35adce8d8 in ompi_request_wait_completion () from
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
Post by Gilles Gouaillardet
Post by Christof Koehler
#5 0x00002ad35adce838 in mca_pml_cm_recv () from
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
Post by Gilles Gouaillardet
Post by Christof Koehler
#6 0x00002ad35ad4da42 in ompi_coll_base_allreduce_intra_recursivedoubling
() from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
Post by Gilles Gouaillardet
Post by Christof Koehler
#7 0x00002ad35ad52906 in ompi_coll_tuned_allreduce_intra_dec_fixed
() from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
Post by Gilles Gouaillardet
Post by Christof Koehler
#8 0x00002ad35ad1f0f4 in PMPI_Allreduce () from
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
Post by Gilles Gouaillardet
Post by Christof Koehler
#9 0x00002ad35aa99c38 in pmpi_allreduce__ () from
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi_mpifh.so.20
Post by Gilles Gouaillardet
Post by Christof Koehler
#10 0x000000000045f8c6 in m_sum_i_ ()
#11 0x0000000000e1ce69 in mlwf_mp_mlwf_wannier90_ ()
#12 0x00000000004331ff in vamp () at main.F:2640
#13 0x000000000040ea1e in main ()
#14 0x00002ad35b080b15 in __libc_start_main () from /lib64/libc.so.6
#15 0x000000000040e929 in _start ()
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
<javascript:;>
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
--
Dr. rer. nat. Christof Köhler email: ***@bccms.uni-bremen.de
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen

PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
Noam Bernstein
2016-12-07 15:19:10 UTC
Permalink
Post by Christof Koehler
I really think the hang is a consequence of
unclean termination (in the sense that the non-root ranks are not
terminated) and probably not the cause, in my interpretation of what I
see. Would you have any suggestion to catch signals sent between orterun
(mpirun) and the child tasks ?
Do you know where in the code the termination call is? Is it actually calling mpi_abort(), or just doing something ugly like calling fortran “stop”? If the latter, would that explain a possible hang?

Presumably someone here can comment on what the standard says about the validity of terminating without mpi_abort.

Actually, if you’re willing to share enough input files to reproduce, I could take a look. I just recompiled our VASP with openmpi 2.0.1 to fix a crash that was apparently addressed by some change in the memory allocator in a recent version of openmpi. Just e-mail me if that’s the case.

Noam


____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
Christof Koehler
2016-12-07 17:37:47 UTC
Permalink
Hello,
Post by Noam Bernstein
Post by Christof Koehler
I really think the hang is a consequence of
unclean termination (in the sense that the non-root ranks are not
terminated) and probably not the cause, in my interpretation of what I
see. Would you have any suggestion to catch signals sent between orterun
(mpirun) and the child tasks ?
Do you know where in the code the termination call is? Is it actually calling mpi_abort(), or just doing something ugly like calling fortran “stop”? If the latter, would that explain a possible hang?
Well, basically it tries to use wannier90 (LWANNIER=.TRUE.). The wannier90 input contains
an error, a restart is requested and the wannier90.chk file the restart
information is missing.
"
Exiting.......
Error: restart requested but wannier90.chk file not found
"
So it must terminate.

The termination happens in the libwannier.a, source file io.F90:

write(stdout,*) 'Exiting.......'
write(stdout, '(1x,a)') trim(error_msg)
close(stdout)
stop "wannier90 error: examine the output/error file for details"

So it calls stop as you assumed.
Post by Noam Bernstein
Presumably someone here can comment on what the standard says about the validity of terminating without mpi_abort.
Well, probably stop is not a good way to terminate then.

My main point was the change relative to 1.10 anyway :-)
Post by Noam Bernstein
Actually, if you’re willing to share enough input files to reproduce, I could take a look. I just recompiled our VASP with openmpi 2.0.1 to fix a crash that was apparently addressed by some change in the memory allocator in a recent version of openmpi. Just e-mail me if that’s the case.
I think that is no longer necessary ? In principle it is no problem but
it at the end of a (small) GW calculation, the Si tutorial example.
So the mail would be abit larger due to the WAVECAR.
Post by Noam Bernstein
Noam
____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
--
Dr. rer. nat. Christof Köhler email: ***@bccms.uni-bremen.de
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen

PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
r***@open-mpi.org
2016-12-07 17:47:48 UTC
Permalink
Hi Christof

Sorry if I missed this, but it sounds like you are saying that one of your procs abnormally terminates, and we are failing to kill the remaining job? Is that correct?

If so, I just did some work that might relate to that problem that is pending in PR #2528: https://github.com/open-mpi/ompi/pull/2528 <https://github.com/open-mpi/ompi/pull/2528>

Would you be able to try that?

Ralph
Post by Christof Koehler
Hello,
Post by Noam Bernstein
Post by Christof Koehler
I really think the hang is a consequence of
unclean termination (in the sense that the non-root ranks are not
terminated) and probably not the cause, in my interpretation of what I
see. Would you have any suggestion to catch signals sent between orterun
(mpirun) and the child tasks ?
Do you know where in the code the termination call is? Is it actually calling mpi_abort(), or just doing something ugly like calling fortran “stop”? If the latter, would that explain a possible hang?
Well, basically it tries to use wannier90 (LWANNIER=.TRUE.). The wannier90 input contains
an error, a restart is requested and the wannier90.chk file the restart
information is missing.
"
Exiting.......
Error: restart requested but wannier90.chk file not found
"
So it must terminate.
write(stdout,*) 'Exiting.......'
write(stdout, '(1x,a)') trim(error_msg)
close(stdout)
stop "wannier90 error: examine the output/error file for details"
So it calls stop as you assumed.
Post by Noam Bernstein
Presumably someone here can comment on what the standard says about the validity of terminating without mpi_abort.
Well, probably stop is not a good way to terminate then.
My main point was the change relative to 1.10 anyway :-)
Post by Noam Bernstein
Actually, if you’re willing to share enough input files to reproduce, I could take a look. I just recompiled our VASP with openmpi 2.0.1 to fix a crash that was apparently addressed by some change in the memory allocator in a recent version of openmpi. Just e-mail me if that’s the case.
I think that is no longer necessary ? In principle it is no problem but
it at the end of a (small) GW calculation, the Si tutorial example.
So the mail would be abit larger due to the WAVECAR.
Post by Noam Bernstein
Noam
____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Christof Koehler
2016-12-08 10:40:32 UTC
Permalink
Hello everybody,

I tried it with the nightly and the direct 2.0.2 branch from git which
according to the log should contain that patch

commit d0b97d7a408b87425ca53523de369da405358ba2
Merge: ac8c019 b9420bb
Author: Jeff Squyres <***@users.noreply.github.com>
Date: Wed Dec 7 18:24:46 2016 -0500
Merge pull request #2528 from rhc54/cmr20x/signals

Unfortunately it changes nothing. The root rank stops and all other
ranks (and mpirun) just stay, the remaining ranks at 100 % CPU waiting
apparently in that allreduce. The stack trace looks a bit more
interesting (git is always debug build ?), so I include it at the very
bottom just in case.

Off-list Gilles Gouaillardet suggested to set breakpoints at exit,
__exit etc. to try to catch signals. Would that be useful ? I need a
moment to figure out how to do this, but I can definitively try.

Some remark: During "make install" from the git repo I see a

WARNING! Common symbols found:
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2complex
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2double_complex
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2double_precision
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2integer
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2real
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_aint
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_band
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_bor
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_bxor
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_byte

I have never noticed this before.


Best Regards

Christof

Thread 1 (Thread 0x2af84cde4840 (LWP 11219)):
#0 0x00002af84e4c669d in poll () from /lib64/libc.so.6
#1 0x00002af850517496 in poll_dispatch () from /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20
#2 0x00002af85050ffa5 in opal_libevent2022_event_base_loop () from /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20
#3 0x00002af85049fa1f in opal_progress () at runtime/opal_progress.c:207
#4 0x00002af84e02f7f7 in ompi_request_default_wait_all (count=233618144, requests=0x2, statuses=0x0) at ../opal/threads/wait_sync.h:80
#5 0x00002af84e0758a7 in ompi_coll_base_allreduce_intra_recursivedoubling (sbuf=0xdecbae0,
rbuf=0x2, count=0, dtype=0xffffffffffffffff, op=0x0, comm=0x1, module=0xdee69e0) at base/coll_base_allreduce.c:225
#6 0x00002af84e07b747 in ompi_coll_tuned_allreduce_intra_dec_fixed (sbuf=0xdecbae0, rbuf=0x2, count=0, dtype=0xffffffffffffffff, op=0x0, comm=0x1, module=0x1) at coll_tuned_decision_fixed.c:66
#7 0x00002af84e03e832 in PMPI_Allreduce (sendbuf=0xdecbae0, recvbuf=0x2, count=0, datatype=0xffffffffffffffff, op=0x0, comm=0x1) at pallreduce.c:107
#8 0x00002af84ddaac90 in ompi_allreduce_f (sendbuf=0xdecbae0 "\005", recvbuf=0x2 <Address 0x2 out of bounds>, count=0x0, datatype=0xffffffffffffffff, op=0x0, comm=0x1, ierr=0x7ffdf3cffe9c) at pallreduce_f.c:87
#9 0x000000000045ecc6 in m_sum_i_ ()
#10 0x0000000000e172c9 in mlwf_mp_mlwf_wannier90_ ()
#11 0x00000000004325ff in vamp () at main.F:2640
#12 0x000000000040de1e in main ()
#13 0x00002af84e3fbb15 in __libc_start_main () from /lib64/libc.so.6
#14 0x000000000040dd29 in _start ()
Post by r***@open-mpi.org
Hi Christof
Sorry if I missed this, but it sounds like you are saying that one of your procs abnormally terminates, and we are failing to kill the remaining job? Is that correct?
If so, I just did some work that might relate to that problem that is pending in PR #2528: https://github.com/open-mpi/ompi/pull/2528 <https://github.com/open-mpi/ompi/pull/2528>
Would you be able to try that?
Ralph
Post by Christof Koehler
Hello,
Post by Noam Bernstein
Post by Christof Koehler
I really think the hang is a consequence of
unclean termination (in the sense that the non-root ranks are not
terminated) and probably not the cause, in my interpretation of what I
see. Would you have any suggestion to catch signals sent between orterun
(mpirun) and the child tasks ?
Do you know where in the code the termination call is? Is it actually calling mpi_abort(), or just doing something ugly like calling fortran “stop”? If the latter, would that explain a possible hang?
Well, basically it tries to use wannier90 (LWANNIER=.TRUE.). The wannier90 input contains
an error, a restart is requested and the wannier90.chk file the restart
information is missing.
"
Exiting.......
Error: restart requested but wannier90.chk file not found
"
So it must terminate.
write(stdout,*) 'Exiting.......'
write(stdout, '(1x,a)') trim(error_msg)
close(stdout)
stop "wannier90 error: examine the output/error file for details"
So it calls stop as you assumed.
Post by Noam Bernstein
Presumably someone here can comment on what the standard says about the validity of terminating without mpi_abort.
Well, probably stop is not a good way to terminate then.
My main point was the change relative to 1.10 anyway :-)
Post by Noam Bernstein
Actually, if you’re willing to share enough input files to reproduce, I could take a look. I just recompiled our VASP with openmpi 2.0.1 to fix a crash that was apparently addressed by some change in the memory allocator in a recent version of openmpi. Just e-mail me if that’s the case.
I think that is no longer necessary ? In principle it is no problem but
it at the end of a (small) GW calculation, the Si tutorial example.
So the mail would be abit larger due to the WAVECAR.
Post by Noam Bernstein
Noam
____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Dr. rer. nat. Christof Köhler email: ***@bccms.uni-bremen.de
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen

PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
Gilles Gouaillardet
2016-12-08 11:05:44 UTC
Permalink
Christof,


There is something really odd with this stack trace.
count is zero, and some pointers do not point to valid addresses (!)

in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that
the stack has been corrupted inside MPI_Allreduce(), or that you are not
using the library you think you use
pmap <pid> will show you which lib is used

btw, this was not started with
mpirun --mca coll ^tuned ...
right ?

just to make it clear ...
a task from your program bluntly issues a fortran STOP, and this is kind of
a feature.
the *only* issue is mpirun does not kill the other MPI tasks and mpirun
never completes.
did i get it right ?

Cheers,

Gilles

On Thursday, December 8, 2016, Christof Koehler <
Post by Christof Koehler
Hello everybody,
I tried it with the nightly and the direct 2.0.2 branch from git which
according to the log should contain that patch
commit d0b97d7a408b87425ca53523de369da405358ba2
Merge: ac8c019 b9420bb
Date: Wed Dec 7 18:24:46 2016 -0500
Merge pull request #2528 from rhc54/cmr20x/signals
Unfortunately it changes nothing. The root rank stops and all other
ranks (and mpirun) just stay, the remaining ranks at 100 % CPU waiting
apparently in that allreduce. The stack trace looks a bit more
interesting (git is always debug build ?), so I include it at the very
bottom just in case.
Off-list Gilles Gouaillardet suggested to set breakpoints at exit,
__exit etc. to try to catch signals. Would that be useful ? I need a
moment to figure out how to do this, but I can definitively try.
Some remark: During "make install" from the git repo I see a
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2complex
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2double_complex
mpi-f08-types.o: 0000000000000004 C
ompi_f08_mpi_2double_precision
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2integer
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2real
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_aint
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_band
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_bor
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_bxor
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_byte
I have never noticed this before.
Best Regards
Christof
#0 0x00002af84e4c669d in poll () from /lib64/libc.so.6
#1 0x00002af850517496 in poll_dispatch () from /cluster/mpi/openmpi/2.0.2/
intel2016/lib/libopen-pal.so.20
#2 0x00002af85050ffa5 in opal_libevent2022_event_base_loop () from
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20
#3 0x00002af85049fa1f in opal_progress () at runtime/opal_progress.c:207
#4 0x00002af84e02f7f7 in ompi_request_default_wait_all (count=233618144,
requests=0x2, statuses=0x0) at ../opal/threads/wait_sync.h:80
#5 0x00002af84e0758a7 in ompi_coll_base_allreduce_intra_recursivedoubling (sbuf=0xdecbae0,
rbuf=0x2, count=0, dtype=0xffffffffffffffff, op=0x0, comm=0x1,
module=0xdee69e0) at base/coll_base_allreduce.c:225
#6 0x00002af84e07b747 in ompi_coll_tuned_allreduce_intra_dec_fixed
(sbuf=0xdecbae0, rbuf=0x2, count=0, dtype=0xffffffffffffffff, op=0x0,
comm=0x1, module=0x1) at coll_tuned_decision_fixed.c:66
#7 0x00002af84e03e832 in PMPI_Allreduce (sendbuf=0xdecbae0, recvbuf=0x2,
count=0, datatype=0xffffffffffffffff, op=0x0, comm=0x1) at pallreduce.c:107
#8 0x00002af84ddaac90 in ompi_allreduce_f (sendbuf=0xdecbae0 "\005",
recvbuf=0x2 <Address 0x2 out of bounds>, count=0x0,
datatype=0xffffffffffffffff, op=0x0, comm=0x1, ierr=0x7ffdf3cffe9c) at
pallreduce_f.c:87
#9 0x000000000045ecc6 in m_sum_i_ ()
#10 0x0000000000e172c9 in mlwf_mp_mlwf_wannier90_ ()
#11 0x00000000004325ff in vamp () at main.F:2640
#12 0x000000000040de1e in main ()
#13 0x00002af84e3fbb15 in __libc_start_main () from /lib64/libc.so.6
#14 0x000000000040dd29 in _start ()
Post by r***@open-mpi.org
Hi Christof
Sorry if I missed this, but it sounds like you are saying that one of
your procs abnormally terminates, and we are failing to kill the remaining
job? Is that correct?
Post by r***@open-mpi.org
If so, I just did some work that might relate to that problem that is
pending in PR #2528: https://github.com/open-mpi/ompi/pull/2528 <
https://github.com/open-mpi/ompi/pull/2528>
Post by r***@open-mpi.org
Would you be able to try that?
Ralph
On Dec 7, 2016, at 9:37 AM, Christof Koehler <
Hello,
Post by Noam Bernstein
On Dec 7, 2016, at 10:07 AM, Christof Koehler <
I really think the hang is a consequence of
unclean termination (in the sense that the non-root ranks are not
terminated) and probably not the cause, in my interpretation of what
I
Post by r***@open-mpi.org
Post by Noam Bernstein
see. Would you have any suggestion to catch signals sent between
orterun
Post by r***@open-mpi.org
Post by Noam Bernstein
(mpirun) and the child tasks ?
Do you know where in the code the termination call is? Is it
actually calling mpi_abort(), or just doing something ugly like calling
fortran “stop”? If the latter, would that explain a possible hang?
Post by r***@open-mpi.org
Well, basically it tries to use wannier90 (LWANNIER=.TRUE.). The
wannier90 input contains
Post by r***@open-mpi.org
an error, a restart is requested and the wannier90.chk file the restart
information is missing.
"
Exiting.......
Error: restart requested but wannier90.chk file not found
"
So it must terminate.
write(stdout,*) 'Exiting.......'
write(stdout, '(1x,a)') trim(error_msg)
close(stdout)
stop "wannier90 error: examine the output/error file for details"
So it calls stop as you assumed.
Post by Noam Bernstein
Presumably someone here can comment on what the standard says about
the validity of terminating without mpi_abort.
Post by r***@open-mpi.org
Well, probably stop is not a good way to terminate then.
My main point was the change relative to 1.10 anyway :-)
Post by Noam Bernstein
Actually, if you’re willing to share enough input files to reproduce,
I could take a look. I just recompiled our VASP with openmpi 2.0.1 to fix
a crash that was apparently addressed by some change in the memory
allocator in a recent version of openmpi. Just e-mail me if that’s the
case.
Post by r***@open-mpi.org
I think that is no longer necessary ? In principle it is no problem but
it at the end of a (small) GW calculation, the Si tutorial example.
So the mail would be abit larger due to the WAVECAR.
Noam
Post by r***@open-mpi.org
Post by Noam Bernstein
____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
<javascript:;>
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
Christof Koehler
2016-12-08 11:39:06 UTC
Permalink
This post might be inappropriate. Click to display it.
Christof Koehler
2016-12-08 13:18:15 UTC
Permalink
Hello again,

I am still not sure about breakpoints. But I did a "catch signal" in
gdb, gdb's were attached to the two vasp processes and mpirun.

When the root rank exits I see in the gdb attaching to it
[Thread 0x2b2787df8700 (LWP 2457) exited]
[Thread 0x2b277f483180 (LWP 2455) exited]
[Inferior 1 (process 2455) exited normally]

In the gdb attached to the mpirun
Catchpoint 1 (signal SIGCHLD), 0x00002b16560f769d in poll () from
/lib64/libc.so.6

In the gdb attached to the second rank I see no output.

Issuing "continue" in the gdb session attached to mpi run does not lead
to anything new as far as I can tell.

The stack trace of the mpirun after that (Ctrl-C'ed to stop it again) is
#0 0x00002b16560f769d in poll () from /lib64/libc.so.6
#1 0x00002b1654b3a496 in poll_dispatch () from
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20
#2 0x00002b1654b32fa5 in opal_libevent2022_event_base_loop () from
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20
#3 0x0000000000406311 in orterun (argc=7, argv=0x7ffdabfbebc8) at
orterun.c:1071
#4 0x00000000004037e0 in main (argc=7, argv=0x7ffdabfbebc8) at
main.c:13

So there is a signal and mpirun does nothing with it ?

Cheers

Christof
Post by Christof Koehler
Hello,
Christof,
There is something really odd with this stack trace.
count is zero, and some pointers do not point to valid addresses (!)
Yes, I assumed it was interesting :-) Note that the program is compiled
with -O2 -fp-model source, so optimization is on. I can try with -O0
or the gcc/gfortran ( will take a moment) to make sure it is not a
problem from that.
in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that
the stack has been corrupted inside MPI_Allreduce(), or that you are not
using the library you think you use
pmap <pid> will show you which lib is used
The pmap of the survivor is at the very end of this mail.
btw, this was not started with
mpirun --mca coll ^tuned ...
right ?
This is correct, not started with "mpirun --mca coll ^tuned". Using it
does not change something.
just to make it clear ...
a task from your program bluntly issues a fortran STOP, and this is kind of
a feature.
Yes. The library where the stack occurs is/was written for serial use as
far as I can tell. As I mentioned, it is not our code but this one
http://www.wannier.org/ (Version 1.2) linked into https://www.vasp.at/ which should
be a working combination.
the *only* issue is mpirun does not kill the other MPI tasks and mpirun
never completes.
did i get it right ?
Yes ! So it is not a really big problem IMO. Just a bit nasty if this
would happen with a job in the queueing system.
Best Regards
Christof
Note: git branch 2.0.2 of openmpi was configured and installed (make
install) with
./configure CC=icc CXX=icpc FC=ifort F77=ifort FFLAGS="-O1 -fp-model
precise" CFLAGS="-O1 -fp-model precise" CXXFLAGS="-O1 -fp-model precise"
FCFLAGS="-O1 -fp-model precise" --with-psm2 --with-tm
--with-hwloc=internal --enable-static --enable-orterun-prefix-by-default
--prefix=/cluster/mpi/openmpi/2.0.2/intel2016
The OS is Centos 7, relatively current :-) with current Omni-Path driver
package from Intel (10.2).
vasp is linked againts Intel MKL Lapack/Blas, self compiled scalapack
(trunk 206) and FFTW 3.3.5. FFTW and scalapack statically linked. And of
course the libwannier.a version 1.2 statically linked.
pmap -p of the survivor
32282: /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
0000000000400000 65200K r-x-- /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
00000000045ab000 100K r---- /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
00000000045c4000 2244K rw--- /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
00000000047f5000 100900K rw--- [ anon ]
000000000bfaa000 684K rw--- [ anon ]
000000000c055000 20K rw--- [ anon ]
000000000c05a000 424K rw--- [ anon ]
000000000c0c4000 68K rw--- [ anon ]
000000000c0d5000 25384K rw--- [ anon ]
00002b17e34f6000 132K r-x-- /usr/lib64/ld-2.17.so
00002b17e3517000 4K rw--- [ anon ]
00002b17e3518000 28K rw-s- /dev/infiniband/uverbs0
00002b17e3523000 88K rw--- [ anon ]
00002b17e3539000 772K rw-s- /dev/infiniband/uverbs0
00002b17e35fa000 772K rw-s- /dev/infiniband/uverbs0
00002b17e36bb000 196K rw-s- /dev/infiniband/uverbs0
00002b17e36ec000 28K rw-s- /dev/infiniband/uverbs0
00002b17e36f3000 20K rw-s- /dev/infiniband/uverbs0
00002b17e3717000 4K r---- /usr/lib64/ld-2.17.so
00002b17e3718000 4K rw--- /usr/lib64/ld-2.17.so
00002b17e3719000 4K rw--- [ anon ]
00002b17e371a000 88K r-x-- /usr/lib64/libpthread-2.17.so
00002b17e3730000 2048K ----- /usr/lib64/libpthread-2.17.so
00002b17e3930000 4K r---- /usr/lib64/libpthread-2.17.so
00002b17e3931000 4K rw--- /usr/lib64/libpthread-2.17.so
00002b17e3932000 16K rw--- [ anon ]
00002b17e3936000 1028K r-x-- /usr/lib64/libm-2.17.so
00002b17e3a37000 2044K ----- /usr/lib64/libm-2.17.so
00002b17e3c36000 4K r---- /usr/lib64/libm-2.17.so
00002b17e3c37000 4K rw--- /usr/lib64/libm-2.17.so
00002b17e3c38000 12K r-x-- /usr/lib64/libdl-2.17.so
00002b17e3c3b000 2044K ----- /usr/lib64/libdl-2.17.so
00002b17e3e3a000 4K r---- /usr/lib64/libdl-2.17.so
00002b17e3e3b000 4K rw--- /usr/lib64/libdl-2.17.so
00002b17e3e3c000 184K r-x-- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0
00002b17e3e6a000 2044K ----- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0
00002b17e4069000 4K r---- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0
00002b17e406a000 4K rw--- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0
00002b17e406b000 36K r-x-- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0
00002b17e4074000 2044K ----- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0
00002b17e4273000 4K r---- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0
00002b17e4274000 4K rw--- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0
00002b17e4275000 396K r-x-- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_mpifh.so.20.0.0
00002b17e42d8000 2044K ----- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_mpifh.so.20.0.0
00002b17e44d7000 4K r---- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_mpifh.so.20.0.0
00002b17e44d8000 4K rw--- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_mpifh.so.20.0.0
00002b17e44d9000 1948K r-x-- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi.so.20.0.1
00002b17e46c0000 2044K ----- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi.so.20.0.1
00002b17e48bf000 12K r---- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi.so.20.0.1
00002b17e48c2000 104K rw--- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi.so.20.0.1
00002b17e48dc000 76K rw--- [ anon ]
00002b17e48ef000 948K r-x-- /usr/lib64/libc-2.17.so
00002b17e49dc000 4K r-x-- /usr/lib64/libc-2.17.so
00002b17e49dd000 12K r-x-- /usr/lib64/libc-2.17.so
00002b17e49e0000 4K r-x-- /usr/lib64/libc-2.17.so
00002b17e49e1000 20K r-x-- /usr/lib64/libc-2.17.so
00002b17e49e6000 8K r-x-- /usr/lib64/libc-2.17.so
00002b17e49e8000 760K r-x-- /usr/lib64/libc-2.17.so
00002b17e4aa6000 2048K ----- /usr/lib64/libc-2.17.so
00002b17e4ca6000 16K r---- /usr/lib64/libc-2.17.so
00002b17e4caa000 8K rw--- /usr/lib64/libc-2.17.so
00002b17e4cac000 20K rw--- [ anon ]
00002b17e4cb1000 84K r-x-- /usr/lib64/libgcc_s-4.8.5-20150702.so.1
00002b17e4cc6000 2044K ----- /usr/lib64/libgcc_s-4.8.5-20150702.so.1
00002b17e4ec5000 4K r---- /usr/lib64/libgcc_s-4.8.5-20150702.so.1
00002b17e4ec6000 4K rw--- /usr/lib64/libgcc_s-4.8.5-20150702.so.1
00002b17e4ec7000 452K r-x-- /usr/lib64/libpsm2.so.2.1
00002b17e4f38000 2044K ----- /usr/lib64/libpsm2.so.2.1
00002b17e5137000 4K r---- /usr/lib64/libpsm2.so.2.1
00002b17e5138000 8K rw--- /usr/lib64/libpsm2.so.2.1
00002b17e513a000 4K rw--- [ anon ]
00002b17e513b000 1344K r-x-- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-rte.so.20.0.0
00002b17e528b000 2044K ----- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-rte.so.20.0.0
00002b17e548a000 8K r---- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-rte.so.20.0.0
00002b17e548c000 44K rw--- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-rte.so.20.0.0
00002b17e5497000 12K rw--- [ anon ]
00002b17e549a000 480K r-x-- /usr/lib64/libtorque.so.2.0.0
00002b17e5512000 2044K ----- /usr/lib64/libtorque.so.2.0.0
00002b17e5711000 8K r---- /usr/lib64/libtorque.so.2.0.0
00002b17e5713000 8K rw--- /usr/lib64/libtorque.so.2.0.0
00002b17e5715000 6704K rw--- [ anon ]
00002b17e5da1000 1404K r-x-- /usr/lib64/libxml2.so.2.9.1
00002b17e5f00000 2044K ----- /usr/lib64/libxml2.so.2.9.1
00002b17e60ff000 32K r---- /usr/lib64/libxml2.so.2.9.1
00002b17e6107000 8K rw--- /usr/lib64/libxml2.so.2.9.1
00002b17e6109000 8K rw--- [ anon ]
00002b17e610b000 84K r-x-- /usr/lib64/libz.so.1.2.7
00002b17e6120000 2044K ----- /usr/lib64/libz.so.1.2.7
00002b17e631f000 4K r---- /usr/lib64/libz.so.1.2.7
00002b17e6320000 4K rw--- /usr/lib64/libz.so.1.2.7
00002b17e6321000 1784K r-x-- /usr/lib64/libcrypto.so.1.0.1e
00002b17e64df000 2048K ----- /usr/lib64/libcrypto.so.1.0.1e
00002b17e66df000 104K r---- /usr/lib64/libcrypto.so.1.0.1e
00002b17e66f9000 48K rw--- /usr/lib64/libcrypto.so.1.0.1e
00002b17e6705000 16K rw--- [ anon ]
00002b17e6709000 396K r-x-- /usr/lib64/libssl.so.1.0.1e
00002b17e676c000 2044K ----- /usr/lib64/libssl.so.1.0.1e
00002b17e696b000 16K r---- /usr/lib64/libssl.so.1.0.1e
00002b17e696f000 28K rw--- /usr/lib64/libssl.so.1.0.1e
00002b17e6976000 1572K r-x-- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20.1.0
00002b17e6aff000 2044K ----- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20.1.0
00002b17e6cfe000 20K r---- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20.1.0
00002b17e6d03000 56K rw--- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20.1.0
00002b17e6d11000 552K rw--- [ anon ]
00002b17e6d9b000 84K r-x-- /usr/lib64/librdmacm.so.1.0.0
00002b17e6db0000 2044K ----- /usr/lib64/librdmacm.so.1.0.0
00002b17e6faf000 4K r---- /usr/lib64/librdmacm.so.1.0.0
00002b17e6fb0000 4K rw--- /usr/lib64/librdmacm.so.1.0.0
00002b17e6fb1000 4K rw--- [ anon ]
00002b17e6fb2000 68K r-x-- /usr/lib64/libibverbs.so.1.0.0
00002b17e6fc3000 2044K ----- /usr/lib64/libibverbs.so.1.0.0
00002b17e71c2000 4K r---- /usr/lib64/libibverbs.so.1.0.0
00002b17e71c3000 4K rw--- /usr/lib64/libibverbs.so.1.0.0
00002b17e71c4000 40K r-x-- /usr/lib64/libnuma.so.1
00002b17e71ce000 2048K ----- /usr/lib64/libnuma.so.1
00002b17e73ce000 4K r---- /usr/lib64/libnuma.so.1
00002b17e73cf000 4K rw--- /usr/lib64/libnuma.so.1
00002b17e73d0000 32K r-x-- /usr/lib64/libpciaccess.so.0.11.1
00002b17e73d8000 2048K ----- /usr/lib64/libpciaccess.so.0.11.1
00002b17e75d8000 4K r---- /usr/lib64/libpciaccess.so.0.11.1
00002b17e75d9000 4K rw--- /usr/lib64/libpciaccess.so.0.11.1
00002b17e75da000 28K r-x-- /usr/lib64/librt-2.17.so
00002b17e75e1000 2044K ----- /usr/lib64/librt-2.17.so
00002b17e77e0000 4K r---- /usr/lib64/librt-2.17.so
00002b17e77e1000 4K rw--- /usr/lib64/librt-2.17.so
00002b17e77e2000 8K r-x-- /usr/lib64/libutil-2.17.so
00002b17e77e4000 2044K ----- /usr/lib64/libutil-2.17.so
00002b17e79e3000 4K r---- /usr/lib64/libutil-2.17.so
00002b17e79e4000 4K rw--- /usr/lib64/libutil-2.17.so
00002b17e79e5000 152K r-x-- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifport.so.5
00002b17e7a0b000 2044K ----- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifport.so.5
00002b17e7c0a000 4K r---- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifport.so.5
00002b17e7c0b000 8K rw--- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifport.so.5
00002b17e7c0d000 24K rw--- [ anon ]
00002b17e7c13000 1288K r-x-- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcore.so.5
00002b17e7d55000 2044K ----- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcore.so.5
00002b17e7f54000 12K r---- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcore.so.5
00002b17e7f57000 12K rw--- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcore.so.5
00002b17e7f5a000 116K rw--- [ anon ]
00002b17e7f77000 2696K r-x-- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libimf.so
00002b17e8219000 2044K ----- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libimf.so
00002b17e8418000 24K r---- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libimf.so
00002b17e841e000 340K rw--- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libimf.so
00002b17e8473000 420K r-x-- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libintlc.so.5
00002b17e84dc000 2048K ----- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libintlc.so.5
00002b17e86dc000 4K r---- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libintlc.so.5
00002b17e86dd000 4K rw--- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libintlc.so.5
00002b17e86de000 4K rw--- [ anon ]
00002b17e86df000 13124K r-x-- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libsvml.so
00002b17e93b0000 2048K ----- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libsvml.so
00002b17e95b0000 220K r---- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libsvml.so
00002b17e95e7000 20K rw--- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libsvml.so
00002b17e95ec000 1304K r-x-- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcoremt.so.5
00002b17e9732000 2048K ----- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcoremt.so.5
00002b17e9932000 12K r---- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcoremt.so.5
00002b17e9935000 12K rw--- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcoremt.so.5
00002b17e9938000 296K rw--- [ anon ]
00002b17e9982000 1464K r-x-- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libirng.so
00002b17e9af0000 2044K ----- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libirng.so
00002b17e9cef000 4K r---- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libirng.so
00002b17e9cf0000 16K rw--- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libirng.so
00002b17e9cf4000 16K r-x-- /usr/lib64/libuuid.so.1.3.0
00002b17e9cf8000 2044K ----- /usr/lib64/libuuid.so.1.3.0
00002b17e9ef7000 4K r---- /usr/lib64/libuuid.so.1.3.0
00002b17e9ef8000 4K rw--- /usr/lib64/libuuid.so.1.3.0
00002b17e9ef9000 932K r-x-- /usr/lib64/libstdc++.so.6.0.19
00002b17e9fe2000 2048K ----- /usr/lib64/libstdc++.so.6.0.19
00002b17ea1e2000 32K r---- /usr/lib64/libstdc++.so.6.0.19
00002b17ea1ea000 8K rw--- /usr/lib64/libstdc++.so.6.0.19
00002b17ea1ec000 84K rw--- [ anon ]
00002b17ea201000 144K r-x-- /usr/lib64/liblzma.so.5.0.99
00002b17ea225000 2044K ----- /usr/lib64/liblzma.so.5.0.99
00002b17ea424000 4K r---- /usr/lib64/liblzma.so.5.0.99
00002b17ea425000 4K rw--- /usr/lib64/liblzma.so.5.0.99
00002b17ea426000 292K r-x-- /usr/lib64/libgssapi_krb5.so.2.2
00002b17ea46f000 2048K ----- /usr/lib64/libgssapi_krb5.so.2.2
00002b17ea66f000 4K r---- /usr/lib64/libgssapi_krb5.so.2.2
00002b17ea670000 8K rw--- /usr/lib64/libgssapi_krb5.so.2.2
00002b17ea672000 852K r-x-- /usr/lib64/libkrb5.so.3.3
00002b17ea747000 2048K ----- /usr/lib64/libkrb5.so.3.3
00002b17ea947000 52K r---- /usr/lib64/libkrb5.so.3.3
00002b17ea954000 12K rw--- /usr/lib64/libkrb5.so.3.3
00002b17ea957000 12K r-x-- /usr/lib64/libcom_err.so.2.1
00002b17ea95a000 2044K ----- /usr/lib64/libcom_err.so.2.1
00002b17eab59000 4K r---- /usr/lib64/libcom_err.so.2.1
00002b17eab5a000 4K rw--- /usr/lib64/libcom_err.so.2.1
00002b17eab5b000 188K r-x-- /usr/lib64/libk5crypto.so.3.1
00002b17eab8a000 2044K ----- /usr/lib64/libk5crypto.so.3.1
00002b17ead89000 8K r---- /usr/lib64/libk5crypto.so.3.1
00002b17ead8b000 4K rw--- /usr/lib64/libk5crypto.so.3.1
00002b17ead8c000 4K rw--- [ anon ]
00002b17ead8d000 284K r-x-- /usr/lib64/libnl-route-3.so.200.16.1
00002b17eadd4000 2044K ----- /usr/lib64/libnl-route-3.so.200.16.1
00002b17eafd3000 12K r---- /usr/lib64/libnl-route-3.so.200.16.1
00002b17eafd6000 16K rw--- /usr/lib64/libnl-route-3.so.200.16.1
00002b17eafda000 8K rw--- [ anon ]
00002b17eafdc000 104K r-x-- /usr/lib64/libnl-3.so.200.16.1
00002b17eaff6000 2044K ----- /usr/lib64/libnl-3.so.200.16.1
00002b17eb1f5000 8K r---- /usr/lib64/libnl-3.so.200.16.1
00002b17eb1f7000 4K rw--- /usr/lib64/libnl-3.so.200.16.1
00002b17eb1f8000 52K r-x-- /usr/lib64/libkrb5support.so.0.1
00002b17eb205000 2048K ----- /usr/lib64/libkrb5support.so.0.1
00002b17eb405000 4K r---- /usr/lib64/libkrb5support.so.0.1
00002b17eb406000 4K rw--- /usr/lib64/libkrb5support.so.0.1
00002b17eb407000 12K r-x-- /usr/lib64/libkeyutils.so.1.5
00002b17eb40a000 2044K ----- /usr/lib64/libkeyutils.so.1.5
00002b17eb609000 4K r---- /usr/lib64/libkeyutils.so.1.5
00002b17eb60a000 4K rw--- /usr/lib64/libkeyutils.so.1.5
00002b17eb60b000 88K r-x-- /usr/lib64/libresolv-2.17.so
00002b17eb621000 2048K ----- /usr/lib64/libresolv-2.17.so
00002b17eb821000 4K r---- /usr/lib64/libresolv-2.17.so
00002b17eb822000 4K rw--- /usr/lib64/libresolv-2.17.so
00002b17eb823000 8K rw--- [ anon ]
00002b17eb825000 132K r-x-- /usr/lib64/libselinux.so.1
00002b17eb846000 2048K ----- /usr/lib64/libselinux.so.1
00002b17eba46000 4K r---- /usr/lib64/libselinux.so.1
00002b17eba47000 4K rw--- /usr/lib64/libselinux.so.1
00002b17eba48000 8K rw--- [ anon ]
00002b17eba4a000 384K r-x-- /usr/lib64/libpcre.so.1.2.0
00002b17ebaaa000 2044K ----- /usr/lib64/libpcre.so.1.2.0
00002b17ebca9000 4K r---- /usr/lib64/libpcre.so.1.2.0
00002b17ebcaa000 4K rw--- /usr/lib64/libpcre.so.1.2.0
00002b17ebcab000 4K ----- [ anon ]
00002b17ebcac000 3352K rw--- [ anon ]
00002b17ec000000 132K rw--- [ anon ]
00002b17ec021000 65404K ----- [ anon ]
00002b17f0000000 4K ----- [ anon ]
00002b17f0001000 2048K rw--- [ anon ]
00002b17f0201000 16K r-x-- /usr/lib64/libhfi1verbs-rdmav2.so
00002b17f0205000 2044K ----- /usr/lib64/libhfi1verbs-rdmav2.so
00002b17f0404000 4K r---- /usr/lib64/libhfi1verbs-rdmav2.so
00002b17f0405000 4K rw--- /usr/lib64/libhfi1verbs-rdmav2.so
00002b17f0406000 4K rw--- [ anon ]
00002b17f0407000 4096K rw--- [ anon ]
00002b17f0807000 1032K rw--- [ anon ]
00002b17f0d0a000 4236K rw-s- /dev/shm/psm2_shm.1200100000001a17100200
00002b17f112d000 132K rw--- [ anon ]
00002b17f114e000 4236K rw-s- /dev/shm/psm2_shm.1200100000000a17100000 (deleted)
00002b17f1571000 8628K rw--- [ anon ]
00002b17f4000000 132K rw--- [ anon ]
00002b17f4021000 65404K ----- [ anon ]
00002b17f9e85000 9164K rw--- [ anon ]
00007ffd8b021000 31316K rw--- [ stack ]
00007ffd8cfa4000 8K r-x-- [ anon ]
ffffffffff600000 4K r-x-- [ anon ]
total 539352K
Cheers,
Gilles
On Thursday, December 8, 2016, Christof Koehler <
Post by Christof Koehler
Hello everybody,
I tried it with the nightly and the direct 2.0.2 branch from git which
according to the log should contain that patch
commit d0b97d7a408b87425ca53523de369da405358ba2
Merge: ac8c019 b9420bb
Date: Wed Dec 7 18:24:46 2016 -0500
Merge pull request #2528 from rhc54/cmr20x/signals
Unfortunately it changes nothing. The root rank stops and all other
ranks (and mpirun) just stay, the remaining ranks at 100 % CPU waiting
apparently in that allreduce. The stack trace looks a bit more
interesting (git is always debug build ?), so I include it at the very
bottom just in case.
Off-list Gilles Gouaillardet suggested to set breakpoints at exit,
__exit etc. to try to catch signals. Would that be useful ? I need a
moment to figure out how to do this, but I can definitively try.
Some remark: During "make install" from the git repo I see a
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2complex
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2double_complex
mpi-f08-types.o: 0000000000000004 C
ompi_f08_mpi_2double_precision
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2integer
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2real
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_aint
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_band
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_bor
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_bxor
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_byte
I have never noticed this before.
Best Regards
Christof
#0 0x00002af84e4c669d in poll () from /lib64/libc.so.6
#1 0x00002af850517496 in poll_dispatch () from /cluster/mpi/openmpi/2.0.2/
intel2016/lib/libopen-pal.so.20
#2 0x00002af85050ffa5 in opal_libevent2022_event_base_loop () from
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20
#3 0x00002af85049fa1f in opal_progress () at runtime/opal_progress.c:207
#4 0x00002af84e02f7f7 in ompi_request_default_wait_all (count=233618144,
requests=0x2, statuses=0x0) at ../opal/threads/wait_sync.h:80
#5 0x00002af84e0758a7 in ompi_coll_base_allreduce_intra_recursivedoubling
(sbuf=0xdecbae0,
rbuf=0x2, count=0, dtype=0xffffffffffffffff, op=0x0, comm=0x1,
module=0xdee69e0) at base/coll_base_allreduce.c:225
#6 0x00002af84e07b747 in ompi_coll_tuned_allreduce_intra_dec_fixed
(sbuf=0xdecbae0, rbuf=0x2, count=0, dtype=0xffffffffffffffff, op=0x0,
comm=0x1, module=0x1) at coll_tuned_decision_fixed.c:66
#7 0x00002af84e03e832 in PMPI_Allreduce (sendbuf=0xdecbae0, recvbuf=0x2,
count=0, datatype=0xffffffffffffffff, op=0x0, comm=0x1) at pallreduce.c:107
#8 0x00002af84ddaac90 in ompi_allreduce_f (sendbuf=0xdecbae0 "\005",
recvbuf=0x2 <Address 0x2 out of bounds>, count=0x0,
datatype=0xffffffffffffffff, op=0x0, comm=0x1, ierr=0x7ffdf3cffe9c) at
pallreduce_f.c:87
#9 0x000000000045ecc6 in m_sum_i_ ()
#10 0x0000000000e172c9 in mlwf_mp_mlwf_wannier90_ ()
#11 0x00000000004325ff in vamp () at main.F:2640
#12 0x000000000040de1e in main ()
#13 0x00002af84e3fbb15 in __libc_start_main () from /lib64/libc.so.6
#14 0x000000000040dd29 in _start ()
Post by r***@open-mpi.org
Hi Christof
Sorry if I missed this, but it sounds like you are saying that one of
your procs abnormally terminates, and we are failing to kill the remaining
job? Is that correct?
Post by r***@open-mpi.org
If so, I just did some work that might relate to that problem that is
pending in PR #2528: https://github.com/open-mpi/ompi/pull/2528 <
https://github.com/open-mpi/ompi/pull/2528>
Post by r***@open-mpi.org
Would you be able to try that?
Ralph
On Dec 7, 2016, at 9:37 AM, Christof Koehler <
Hello,
Post by Noam Bernstein
On Dec 7, 2016, at 10:07 AM, Christof Koehler <
I really think the hang is a consequence of
unclean termination (in the sense that the non-root ranks are not
terminated) and probably not the cause, in my interpretation of what
I
Post by r***@open-mpi.org
Post by Noam Bernstein
see. Would you have any suggestion to catch signals sent between
orterun
Post by r***@open-mpi.org
Post by Noam Bernstein
(mpirun) and the child tasks ?
Do you know where in the code the termination call is? Is it
actually calling mpi_abort(), or just doing something ugly like calling
fortran “stop”? If the latter, would that explain a possible hang?
Post by r***@open-mpi.org
Well, basically it tries to use wannier90 (LWANNIER=.TRUE.). The
wannier90 input contains
Post by r***@open-mpi.org
an error, a restart is requested and the wannier90.chk file the restart
information is missing.
"
Exiting.......
Error: restart requested but wannier90.chk file not found
"
So it must terminate.
write(stdout,*) 'Exiting.......'
write(stdout, '(1x,a)') trim(error_msg)
close(stdout)
stop "wannier90 error: examine the output/error file for details"
So it calls stop as you assumed.
Post by Noam Bernstein
Presumably someone here can comment on what the standard says about
the validity of terminating without mpi_abort.
Post by r***@open-mpi.org
Well, probably stop is not a good way to terminate then.
My main point was the change relative to 1.10 anyway :-)
Post by Noam Bernstein
Actually, if you’re willing to share enough input files to reproduce,
I could take a look. I just recompiled our VASP with openmpi 2.0.1 to fix
a crash that was apparently addressed by some change in the memory
allocator in a recent version of openmpi. Just e-mail me if that’s the
case.
Post by r***@open-mpi.org
I think that is no longer necessary ? In principle it is no problem but
it at the end of a (small) GW calculation, the Si tutorial example.
So the mail would be abit larger due to the WAVECAR.
Noam
Post by r***@open-mpi.org
Post by Noam Bernstein
____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
<javascript:;>
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
--
Dr. rer. nat. Christof Köhler email: ***@bccms.uni-bremen.de
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen

PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
r***@open-mpi.org
2016-12-08 15:48:17 UTC
Permalink
To the best I can determine, mpirun catches SIGTERM just fine and will hit the procs with SIGCONT, followed by SIGTERM and then SIGKILL. It will then wait to see the remote daemons complete after they hit their procs with the same sequence.
Post by Christof Koehler
Hello again,
I am still not sure about breakpoints. But I did a "catch signal" in
gdb, gdb's were attached to the two vasp processes and mpirun.
When the root rank exits I see in the gdb attaching to it
[Thread 0x2b2787df8700 (LWP 2457) exited]
[Thread 0x2b277f483180 (LWP 2455) exited]
[Inferior 1 (process 2455) exited normally]
In the gdb attached to the mpirun
Catchpoint 1 (signal SIGCHLD), 0x00002b16560f769d in poll () from
/lib64/libc.so.6
In the gdb attached to the second rank I see no output.
Issuing "continue" in the gdb session attached to mpi run does not lead
to anything new as far as I can tell.
The stack trace of the mpirun after that (Ctrl-C'ed to stop it again) is
#0 0x00002b16560f769d in poll () from /lib64/libc.so.6
#1 0x00002b1654b3a496 in poll_dispatch () from
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20
#2 0x00002b1654b32fa5 in opal_libevent2022_event_base_loop () from
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20
#3 0x0000000000406311 in orterun (argc=7, argv=0x7ffdabfbebc8) at
orterun.c:1071
#4 0x00000000004037e0 in main (argc=7, argv=0x7ffdabfbebc8) at
main.c:13
So there is a signal and mpirun does nothing with it ?
Cheers
Christof
Post by Christof Koehler
Hello,
Christof,
There is something really odd with this stack trace.
count is zero, and some pointers do not point to valid addresses (!)
Yes, I assumed it was interesting :-) Note that the program is compiled
with -O2 -fp-model source, so optimization is on. I can try with -O0
or the gcc/gfortran ( will take a moment) to make sure it is not a
problem from that.
in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that
the stack has been corrupted inside MPI_Allreduce(), or that you are not
using the library you think you use
pmap <pid> will show you which lib is used
The pmap of the survivor is at the very end of this mail.
btw, this was not started with
mpirun --mca coll ^tuned ...
right ?
This is correct, not started with "mpirun --mca coll ^tuned". Using it
does not change something.
just to make it clear ...
a task from your program bluntly issues a fortran STOP, and this is kind of
a feature.
Yes. The library where the stack occurs is/was written for serial use as
far as I can tell. As I mentioned, it is not our code but this one
http://www.wannier.org/ (Version 1.2) linked into https://www.vasp.at/ which should
be a working combination.
the *only* issue is mpirun does not kill the other MPI tasks and mpirun
never completes.
did i get it right ?
Yes ! So it is not a really big problem IMO. Just a bit nasty if this
would happen with a job in the queueing system.
Best Regards
Christof
Note: git branch 2.0.2 of openmpi was configured and installed (make
install) with
./configure CC=icc CXX=icpc FC=ifort F77=ifort FFLAGS="-O1 -fp-model
precise" CFLAGS="-O1 -fp-model precise" CXXFLAGS="-O1 -fp-model precise"
FCFLAGS="-O1 -fp-model precise" --with-psm2 --with-tm
--with-hwloc=internal --enable-static --enable-orterun-prefix-by-default
--prefix=/cluster/mpi/openmpi/2.0.2/intel2016
The OS is Centos 7, relatively current :-) with current Omni-Path driver
package from Intel (10.2).
vasp is linked againts Intel MKL Lapack/Blas, self compiled scalapack
(trunk 206) and FFTW 3.3.5. FFTW and scalapack statically linked. And of
course the libwannier.a version 1.2 statically linked.
pmap -p of the survivor
32282: /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
0000000000400000 65200K r-x-- /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
00000000045ab000 100K r---- /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
00000000045c4000 2244K rw--- /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
00000000047f5000 100900K rw--- [ anon ]
000000000bfaa000 684K rw--- [ anon ]
000000000c055000 20K rw--- [ anon ]
000000000c05a000 424K rw--- [ anon ]
000000000c0c4000 68K rw--- [ anon ]
000000000c0d5000 25384K rw--- [ anon ]
00002b17e34f6000 132K r-x-- /usr/lib64/ld-2.17.so
00002b17e3517000 4K rw--- [ anon ]
00002b17e3518000 28K rw-s- /dev/infiniband/uverbs0
00002b17e3523000 88K rw--- [ anon ]
00002b17e3539000 772K rw-s- /dev/infiniband/uverbs0
00002b17e35fa000 772K rw-s- /dev/infiniband/uverbs0
00002b17e36bb000 196K rw-s- /dev/infiniband/uverbs0
00002b17e36ec000 28K rw-s- /dev/infiniband/uverbs0
00002b17e36f3000 20K rw-s- /dev/infiniband/uverbs0
00002b17e3717000 4K r---- /usr/lib64/ld-2.17.so
00002b17e3718000 4K rw--- /usr/lib64/ld-2.17.so
00002b17e3719000 4K rw--- [ anon ]
00002b17e371a000 88K r-x-- /usr/lib64/libpthread-2.17.so
00002b17e3730000 2048K ----- /usr/lib64/libpthread-2.17.so
00002b17e3930000 4K r---- /usr/lib64/libpthread-2.17.so
00002b17e3931000 4K rw--- /usr/lib64/libpthread-2.17.so
00002b17e3932000 16K rw--- [ anon ]
00002b17e3936000 1028K r-x-- /usr/lib64/libm-2.17.so
00002b17e3a37000 2044K ----- /usr/lib64/libm-2.17.so
00002b17e3c36000 4K r---- /usr/lib64/libm-2.17.so
00002b17e3c37000 4K rw--- /usr/lib64/libm-2.17.so
00002b17e3c38000 12K r-x-- /usr/lib64/libdl-2.17.so
00002b17e3c3b000 2044K ----- /usr/lib64/libdl-2.17.so
00002b17e3e3a000 4K r---- /usr/lib64/libdl-2.17.so
00002b17e3e3b000 4K rw--- /usr/lib64/libdl-2.17.so
00002b17e3e3c000 184K r-x-- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0
00002b17e3e6a000 2044K ----- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0
00002b17e4069000 4K r---- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0
00002b17e406a000 4K rw--- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0
00002b17e406b000 36K r-x-- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0
00002b17e4074000 2044K ----- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0
00002b17e4273000 4K r---- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0
00002b17e4274000 4K rw--- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0
00002b17e4275000 396K r-x-- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_mpifh.so.20.0.0
00002b17e42d8000 2044K ----- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_mpifh.so.20.0.0
00002b17e44d7000 4K r---- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_mpifh.so.20.0.0
00002b17e44d8000 4K rw--- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_mpifh.so.20.0.0
00002b17e44d9000 1948K r-x-- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi.so.20.0.1
00002b17e46c0000 2044K ----- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi.so.20.0.1
00002b17e48bf000 12K r---- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi.so.20.0.1
00002b17e48c2000 104K rw--- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi.so.20.0.1
00002b17e48dc000 76K rw--- [ anon ]
00002b17e48ef000 948K r-x-- /usr/lib64/libc-2.17.so
00002b17e49dc000 4K r-x-- /usr/lib64/libc-2.17.so
00002b17e49dd000 12K r-x-- /usr/lib64/libc-2.17.so
00002b17e49e0000 4K r-x-- /usr/lib64/libc-2.17.so
00002b17e49e1000 20K r-x-- /usr/lib64/libc-2.17.so
00002b17e49e6000 8K r-x-- /usr/lib64/libc-2.17.so
00002b17e49e8000 760K r-x-- /usr/lib64/libc-2.17.so
00002b17e4aa6000 2048K ----- /usr/lib64/libc-2.17.so
00002b17e4ca6000 16K r---- /usr/lib64/libc-2.17.so
00002b17e4caa000 8K rw--- /usr/lib64/libc-2.17.so
00002b17e4cac000 20K rw--- [ anon ]
00002b17e4cb1000 84K r-x-- /usr/lib64/libgcc_s-4.8.5-20150702.so.1
00002b17e4cc6000 2044K ----- /usr/lib64/libgcc_s-4.8.5-20150702.so.1
00002b17e4ec5000 4K r---- /usr/lib64/libgcc_s-4.8.5-20150702.so.1
00002b17e4ec6000 4K rw--- /usr/lib64/libgcc_s-4.8.5-20150702.so.1
00002b17e4ec7000 452K r-x-- /usr/lib64/libpsm2.so.2.1
00002b17e4f38000 2044K ----- /usr/lib64/libpsm2.so.2.1
00002b17e5137000 4K r---- /usr/lib64/libpsm2.so.2.1
00002b17e5138000 8K rw--- /usr/lib64/libpsm2.so.2.1
00002b17e513a000 4K rw--- [ anon ]
00002b17e513b000 1344K r-x-- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-rte.so.20.0.0
00002b17e528b000 2044K ----- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-rte.so.20.0.0
00002b17e548a000 8K r---- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-rte.so.20.0.0
00002b17e548c000 44K rw--- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-rte.so.20.0.0
00002b17e5497000 12K rw--- [ anon ]
00002b17e549a000 480K r-x-- /usr/lib64/libtorque.so.2.0.0
00002b17e5512000 2044K ----- /usr/lib64/libtorque.so.2.0.0
00002b17e5711000 8K r---- /usr/lib64/libtorque.so.2.0.0
00002b17e5713000 8K rw--- /usr/lib64/libtorque.so.2.0.0
00002b17e5715000 6704K rw--- [ anon ]
00002b17e5da1000 1404K r-x-- /usr/lib64/libxml2.so.2.9.1
00002b17e5f00000 2044K ----- /usr/lib64/libxml2.so.2.9.1
00002b17e60ff000 32K r---- /usr/lib64/libxml2.so.2.9.1
00002b17e6107000 8K rw--- /usr/lib64/libxml2.so.2.9.1
00002b17e6109000 8K rw--- [ anon ]
00002b17e610b000 84K r-x-- /usr/lib64/libz.so.1.2.7
00002b17e6120000 2044K ----- /usr/lib64/libz.so.1.2.7
00002b17e631f000 4K r---- /usr/lib64/libz.so.1.2.7
00002b17e6320000 4K rw--- /usr/lib64/libz.so.1.2.7
00002b17e6321000 1784K r-x-- /usr/lib64/libcrypto.so.1.0.1e
00002b17e64df000 2048K ----- /usr/lib64/libcrypto.so.1.0.1e
00002b17e66df000 104K r---- /usr/lib64/libcrypto.so.1.0.1e
00002b17e66f9000 48K rw--- /usr/lib64/libcrypto.so.1.0.1e
00002b17e6705000 16K rw--- [ anon ]
00002b17e6709000 396K r-x-- /usr/lib64/libssl.so.1.0.1e
00002b17e676c000 2044K ----- /usr/lib64/libssl.so.1.0.1e
00002b17e696b000 16K r---- /usr/lib64/libssl.so.1.0.1e
00002b17e696f000 28K rw--- /usr/lib64/libssl.so.1.0.1e
00002b17e6976000 1572K r-x-- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20.1.0
00002b17e6aff000 2044K ----- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20.1.0
00002b17e6cfe000 20K r---- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20.1.0
00002b17e6d03000 56K rw--- /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20.1.0
00002b17e6d11000 552K rw--- [ anon ]
00002b17e6d9b000 84K r-x-- /usr/lib64/librdmacm.so.1.0.0
00002b17e6db0000 2044K ----- /usr/lib64/librdmacm.so.1.0.0
00002b17e6faf000 4K r---- /usr/lib64/librdmacm.so.1.0.0
00002b17e6fb0000 4K rw--- /usr/lib64/librdmacm.so.1.0.0
00002b17e6fb1000 4K rw--- [ anon ]
00002b17e6fb2000 68K r-x-- /usr/lib64/libibverbs.so.1.0.0
00002b17e6fc3000 2044K ----- /usr/lib64/libibverbs.so.1.0.0
00002b17e71c2000 4K r---- /usr/lib64/libibverbs.so.1.0.0
00002b17e71c3000 4K rw--- /usr/lib64/libibverbs.so.1.0.0
00002b17e71c4000 40K r-x-- /usr/lib64/libnuma.so.1
00002b17e71ce000 2048K ----- /usr/lib64/libnuma.so.1
00002b17e73ce000 4K r---- /usr/lib64/libnuma.so.1
00002b17e73cf000 4K rw--- /usr/lib64/libnuma.so.1
00002b17e73d0000 32K r-x-- /usr/lib64/libpciaccess.so.0.11.1
00002b17e73d8000 2048K ----- /usr/lib64/libpciaccess.so.0.11.1
00002b17e75d8000 4K r---- /usr/lib64/libpciaccess.so.0.11.1
00002b17e75d9000 4K rw--- /usr/lib64/libpciaccess.so.0.11.1
00002b17e75da000 28K r-x-- /usr/lib64/librt-2.17.so
00002b17e75e1000 2044K ----- /usr/lib64/librt-2.17.so
00002b17e77e0000 4K r---- /usr/lib64/librt-2.17.so
00002b17e77e1000 4K rw--- /usr/lib64/librt-2.17.so
00002b17e77e2000 8K r-x-- /usr/lib64/libutil-2.17.so
00002b17e77e4000 2044K ----- /usr/lib64/libutil-2.17.so
00002b17e79e3000 4K r---- /usr/lib64/libutil-2.17.so
00002b17e79e4000 4K rw--- /usr/lib64/libutil-2.17.so
00002b17e79e5000 152K r-x-- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifport.so.5
00002b17e7a0b000 2044K ----- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifport.so.5
00002b17e7c0a000 4K r---- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifport.so.5
00002b17e7c0b000 8K rw--- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifport.so.5
00002b17e7c0d000 24K rw--- [ anon ]
00002b17e7c13000 1288K r-x-- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcore.so.5
00002b17e7d55000 2044K ----- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcore.so.5
00002b17e7f54000 12K r---- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcore.so.5
00002b17e7f57000 12K rw--- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcore.so.5
00002b17e7f5a000 116K rw--- [ anon ]
00002b17e7f77000 2696K r-x-- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libimf.so
00002b17e8219000 2044K ----- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libimf.so
00002b17e8418000 24K r---- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libimf.so
00002b17e841e000 340K rw--- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libimf.so
00002b17e8473000 420K r-x-- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libintlc.so.5
00002b17e84dc000 2048K ----- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libintlc.so.5
00002b17e86dc000 4K r---- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libintlc.so.5
00002b17e86dd000 4K rw--- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libintlc.so.5
00002b17e86de000 4K rw--- [ anon ]
00002b17e86df000 13124K r-x-- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libsvml.so
00002b17e93b0000 2048K ----- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libsvml.so
00002b17e95b0000 220K r---- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libsvml.so
00002b17e95e7000 20K rw--- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libsvml.so
00002b17e95ec000 1304K r-x-- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcoremt.so.5
00002b17e9732000 2048K ----- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcoremt.so.5
00002b17e9932000 12K r---- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcoremt.so.5
00002b17e9935000 12K rw--- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcoremt.so.5
00002b17e9938000 296K rw--- [ anon ]
00002b17e9982000 1464K r-x-- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libirng.so
00002b17e9af0000 2044K ----- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libirng.so
00002b17e9cef000 4K r---- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libirng.so
00002b17e9cf0000 16K rw--- /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libirng.so
00002b17e9cf4000 16K r-x-- /usr/lib64/libuuid.so.1.3.0
00002b17e9cf8000 2044K ----- /usr/lib64/libuuid.so.1.3.0
00002b17e9ef7000 4K r---- /usr/lib64/libuuid.so.1.3.0
00002b17e9ef8000 4K rw--- /usr/lib64/libuuid.so.1.3.0
00002b17e9ef9000 932K r-x-- /usr/lib64/libstdc++.so.6.0.19
00002b17e9fe2000 2048K ----- /usr/lib64/libstdc++.so.6.0.19
00002b17ea1e2000 32K r---- /usr/lib64/libstdc++.so.6.0.19
00002b17ea1ea000 8K rw--- /usr/lib64/libstdc++.so.6.0.19
00002b17ea1ec000 84K rw--- [ anon ]
00002b17ea201000 144K r-x-- /usr/lib64/liblzma.so.5.0.99
00002b17ea225000 2044K ----- /usr/lib64/liblzma.so.5.0.99
00002b17ea424000 4K r---- /usr/lib64/liblzma.so.5.0.99
00002b17ea425000 4K rw--- /usr/lib64/liblzma.so.5.0.99
00002b17ea426000 292K r-x-- /usr/lib64/libgssapi_krb5.so.2.2
00002b17ea46f000 2048K ----- /usr/lib64/libgssapi_krb5.so.2.2
00002b17ea66f000 4K r---- /usr/lib64/libgssapi_krb5.so.2.2
00002b17ea670000 8K rw--- /usr/lib64/libgssapi_krb5.so.2.2
00002b17ea672000 852K r-x-- /usr/lib64/libkrb5.so.3.3
00002b17ea747000 2048K ----- /usr/lib64/libkrb5.so.3.3
00002b17ea947000 52K r---- /usr/lib64/libkrb5.so.3.3
00002b17ea954000 12K rw--- /usr/lib64/libkrb5.so.3.3
00002b17ea957000 12K r-x-- /usr/lib64/libcom_err.so.2.1
00002b17ea95a000 2044K ----- /usr/lib64/libcom_err.so.2.1
00002b17eab59000 4K r---- /usr/lib64/libcom_err.so.2.1
00002b17eab5a000 4K rw--- /usr/lib64/libcom_err.so.2.1
00002b17eab5b000 188K r-x-- /usr/lib64/libk5crypto.so.3.1
00002b17eab8a000 2044K ----- /usr/lib64/libk5crypto.so.3.1
00002b17ead89000 8K r---- /usr/lib64/libk5crypto.so.3.1
00002b17ead8b000 4K rw--- /usr/lib64/libk5crypto.so.3.1
00002b17ead8c000 4K rw--- [ anon ]
00002b17ead8d000 284K r-x-- /usr/lib64/libnl-route-3.so.200.16.1
00002b17eadd4000 2044K ----- /usr/lib64/libnl-route-3.so.200.16.1
00002b17eafd3000 12K r---- /usr/lib64/libnl-route-3.so.200.16.1
00002b17eafd6000 16K rw--- /usr/lib64/libnl-route-3.so.200.16.1
00002b17eafda000 8K rw--- [ anon ]
00002b17eafdc000 104K r-x-- /usr/lib64/libnl-3.so.200.16.1
00002b17eaff6000 2044K ----- /usr/lib64/libnl-3.so.200.16.1
00002b17eb1f5000 8K r---- /usr/lib64/libnl-3.so.200.16.1
00002b17eb1f7000 4K rw--- /usr/lib64/libnl-3.so.200.16.1
00002b17eb1f8000 52K r-x-- /usr/lib64/libkrb5support.so.0.1
00002b17eb205000 2048K ----- /usr/lib64/libkrb5support.so.0.1
00002b17eb405000 4K r---- /usr/lib64/libkrb5support.so.0.1
00002b17eb406000 4K rw--- /usr/lib64/libkrb5support.so.0.1
00002b17eb407000 12K r-x-- /usr/lib64/libkeyutils.so.1.5
00002b17eb40a000 2044K ----- /usr/lib64/libkeyutils.so.1.5
00002b17eb609000 4K r---- /usr/lib64/libkeyutils.so.1.5
00002b17eb60a000 4K rw--- /usr/lib64/libkeyutils.so.1.5
00002b17eb60b000 88K r-x-- /usr/lib64/libresolv-2.17.so
00002b17eb621000 2048K ----- /usr/lib64/libresolv-2.17.so
00002b17eb821000 4K r---- /usr/lib64/libresolv-2.17.so
00002b17eb822000 4K rw--- /usr/lib64/libresolv-2.17.so
00002b17eb823000 8K rw--- [ anon ]
00002b17eb825000 132K r-x-- /usr/lib64/libselinux.so.1
00002b17eb846000 2048K ----- /usr/lib64/libselinux.so.1
00002b17eba46000 4K r---- /usr/lib64/libselinux.so.1
00002b17eba47000 4K rw--- /usr/lib64/libselinux.so.1
00002b17eba48000 8K rw--- [ anon ]
00002b17eba4a000 384K r-x-- /usr/lib64/libpcre.so.1.2.0
00002b17ebaaa000 2044K ----- /usr/lib64/libpcre.so.1.2.0
00002b17ebca9000 4K r---- /usr/lib64/libpcre.so.1.2.0
00002b17ebcaa000 4K rw--- /usr/lib64/libpcre.so.1.2.0
00002b17ebcab000 4K ----- [ anon ]
00002b17ebcac000 3352K rw--- [ anon ]
00002b17ec000000 132K rw--- [ anon ]
00002b17ec021000 65404K ----- [ anon ]
00002b17f0000000 4K ----- [ anon ]
00002b17f0001000 2048K rw--- [ anon ]
00002b17f0201000 16K r-x-- /usr/lib64/libhfi1verbs-rdmav2.so
00002b17f0205000 2044K ----- /usr/lib64/libhfi1verbs-rdmav2.so
00002b17f0404000 4K r---- /usr/lib64/libhfi1verbs-rdmav2.so
00002b17f0405000 4K rw--- /usr/lib64/libhfi1verbs-rdmav2.so
00002b17f0406000 4K rw--- [ anon ]
00002b17f0407000 4096K rw--- [ anon ]
00002b17f0807000 1032K rw--- [ anon ]
00002b17f0d0a000 4236K rw-s- /dev/shm/psm2_shm.1200100000001a17100200
00002b17f112d000 132K rw--- [ anon ]
00002b17f114e000 4236K rw-s- /dev/shm/psm2_shm.1200100000000a17100000 (deleted)
00002b17f1571000 8628K rw--- [ anon ]
00002b17f4000000 132K rw--- [ anon ]
00002b17f4021000 65404K ----- [ anon ]
00002b17f9e85000 9164K rw--- [ anon ]
00007ffd8b021000 31316K rw--- [ stack ]
00007ffd8cfa4000 8K r-x-- [ anon ]
ffffffffff600000 4K r-x-- [ anon ]
total 539352K
Cheers,
Gilles
On Thursday, December 8, 2016, Christof Koehler <
Post by Christof Koehler
Hello everybody,
I tried it with the nightly and the direct 2.0.2 branch from git which
according to the log should contain that patch
commit d0b97d7a408b87425ca53523de369da405358ba2
Merge: ac8c019 b9420bb
Date: Wed Dec 7 18:24:46 2016 -0500
Merge pull request #2528 from rhc54/cmr20x/signals
Unfortunately it changes nothing. The root rank stops and all other
ranks (and mpirun) just stay, the remaining ranks at 100 % CPU waiting
apparently in that allreduce. The stack trace looks a bit more
interesting (git is always debug build ?), so I include it at the very
bottom just in case.
Off-list Gilles Gouaillardet suggested to set breakpoints at exit,
__exit etc. to try to catch signals. Would that be useful ? I need a
moment to figure out how to do this, but I can definitively try.
Some remark: During "make install" from the git repo I see a
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2complex
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2double_complex
mpi-f08-types.o: 0000000000000004 C
ompi_f08_mpi_2double_precision
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2integer
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2real
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_aint
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_band
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_bor
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_bxor
mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_byte
I have never noticed this before.
Best Regards
Christof
#0 0x00002af84e4c669d in poll () from /lib64/libc.so.6
#1 0x00002af850517496 in poll_dispatch () from /cluster/mpi/openmpi/2.0.2/
intel2016/lib/libopen-pal.so.20
#2 0x00002af85050ffa5 in opal_libevent2022_event_base_loop () from
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20
#3 0x00002af85049fa1f in opal_progress () at runtime/opal_progress.c:207
#4 0x00002af84e02f7f7 in ompi_request_default_wait_all (count=233618144,
requests=0x2, statuses=0x0) at ../opal/threads/wait_sync.h:80
#5 0x00002af84e0758a7 in ompi_coll_base_allreduce_intra_recursivedoubling
(sbuf=0xdecbae0,
rbuf=0x2, count=0, dtype=0xffffffffffffffff, op=0x0, comm=0x1,
module=0xdee69e0) at base/coll_base_allreduce.c:225
#6 0x00002af84e07b747 in ompi_coll_tuned_allreduce_intra_dec_fixed
(sbuf=0xdecbae0, rbuf=0x2, count=0, dtype=0xffffffffffffffff, op=0x0,
comm=0x1, module=0x1) at coll_tuned_decision_fixed.c:66
#7 0x00002af84e03e832 in PMPI_Allreduce (sendbuf=0xdecbae0, recvbuf=0x2,
count=0, datatype=0xffffffffffffffff, op=0x0, comm=0x1) at pallreduce.c:107
#8 0x00002af84ddaac90 in ompi_allreduce_f (sendbuf=0xdecbae0 "\005",
recvbuf=0x2 <Address 0x2 out of bounds>, count=0x0,
datatype=0xffffffffffffffff, op=0x0, comm=0x1, ierr=0x7ffdf3cffe9c) at
pallreduce_f.c:87
#9 0x000000000045ecc6 in m_sum_i_ ()
#10 0x0000000000e172c9 in mlwf_mp_mlwf_wannier90_ ()
#11 0x00000000004325ff in vamp () at main.F:2640
#12 0x000000000040de1e in main ()
#13 0x00002af84e3fbb15 in __libc_start_main () from /lib64/libc.so.6
#14 0x000000000040dd29 in _start ()
Post by r***@open-mpi.org
Hi Christof
Sorry if I missed this, but it sounds like you are saying that one of
your procs abnormally terminates, and we are failing to kill the remaining
job? Is that correct?
Post by r***@open-mpi.org
If so, I just did some work that might relate to that problem that is
pending in PR #2528: https://github.com/open-mpi/ompi/pull/2528 <
https://github.com/open-mpi/ompi/pull/2528>
Post by r***@open-mpi.org
Would you be able to try that?
Ralph
On Dec 7, 2016, at 9:37 AM, Christof Koehler <
Hello,
Post by Noam Bernstein
On Dec 7, 2016, at 10:07 AM, Christof Koehler <
I really think the hang is a consequence of
unclean termination (in the sense that the non-root ranks are not
terminated) and probably not the cause, in my interpretation of what
I
Post by r***@open-mpi.org
Post by Noam Bernstein
see. Would you have any suggestion to catch signals sent between
orterun
Post by r***@open-mpi.org
Post by Noam Bernstein
(mpirun) and the child tasks ?
Do you know where in the code the termination call is? Is it
actually calling mpi_abort(), or just doing something ugly like calling
fortran “stop”? If the latter, would that explain a possible hang?
Post by r***@open-mpi.org
Well, basically it tries to use wannier90 (LWANNIER=.TRUE.). The
wannier90 input contains
Post by r***@open-mpi.org
an error, a restart is requested and the wannier90.chk file the restart
information is missing.
"
Exiting.......
Error: restart requested but wannier90.chk file not found
"
So it must terminate.
write(stdout,*) 'Exiting.......'
write(stdout, '(1x,a)') trim(error_msg)
close(stdout)
stop "wannier90 error: examine the output/error file for details"
So it calls stop as you assumed.
Post by Noam Bernstein
Presumably someone here can comment on what the standard says about
the validity of terminating without mpi_abort.
Post by r***@open-mpi.org
Well, probably stop is not a good way to terminate then.
My main point was the change relative to 1.10 anyway :-)
Post by Noam Bernstein
Actually, if you’re willing to share enough input files to reproduce,
I could take a look. I just recompiled our VASP with openmpi 2.0.1 to fix
a crash that was apparently addressed by some change in the memory
allocator in a recent version of openmpi. Just e-mail me if that’s the
case.
Post by r***@open-mpi.org
I think that is no longer necessary ? In principle it is no problem but
it at the end of a (small) GW calculation, the Si tutorial example.
So the mail would be abit larger due to the WAVECAR.
Noam
Post by r***@open-mpi.org
Post by Noam Bernstein
____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
<javascript:;>
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/ <http://www.bccms.uni-bremen.de/cms/people/c_koehler/>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
Noam Bernstein
2016-12-08 20:15:47 UTC
Permalink
Christof,
There is something really odd with this stack trace.
count is zero, and some pointers do not point to valid addresses (!)
in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that
the stack has been corrupted inside MPI_Allreduce(), or that you are not using the library you think you use
pmap <pid> will show you which lib is used
btw, this was not started with
mpirun --mca coll ^tuned ...
right ?
just to make it clear ...
a task from your program bluntly issues a fortran STOP, and this is kind of a feature.
the *only* issue is mpirun does not kill the other MPI tasks and mpirun never completes.
did i get it right ?
I just ran across very similar behavior in VASP (which we just switched over to openmpi 2.0.1), also in a allreduce + STOP combination (some nodes call one, others call the other), and I discovered several interesting things.

The most important is that when MPI is active, the preprocessor converts (via a #define in symbol.inc) fortran STOP into calls to m_exit() (defined in mpi.F), which is a wrapper around mpi_finalize. So in my case some processes in the communicator call mpi_finalize, others call mpi_allreduce. I’m not really surprised this hangs, because I think the correct thing to replace STOP with is mpi_abort, not mpi_finalize. If you know where the STOP is called, you can check the preprocessed equivalent file (.f90 instead of .F), and see if it’s actually been replaced with a call to m_exit. I’m planning to test whether replacing m_exit with m_stop in symbol.inc gives more sensible behavior, i.e. program termination when the original source file executes a STOP.

I’m assuming that a mix of mpi_allreduce and mpi_finalize is really expected to hang, but just in case that’s surprising, here are my stack traces:


hung in collective:

(gdb) where
#0 0x00002b8d5a095ec6 in opal_progress () from /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.20
#1 0x00002b8d59b3a36d in ompi_request_default_wait_all () from /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#2 0x00002b8d59b8107c in ompi_coll_base_allreduce_intra_recursivedoubling () from /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#3 0x00002b8d59b495ac in PMPI_Allreduce () from /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#4 0x00002b8d598e4027 in pmpi_allreduce__ () from /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi_mpifh.so.20
#5 0x0000000000414077 in m_sum_i (comm=..., ivec=warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
..., n=2) at mpi.F:989
#6 0x0000000000daac54 in full_kpoints::set_indpw_full (grid=..., wdes=..., kpoints_f=...) at mkpoints_full.F:1099
#7 0x0000000001441654 in set_indpw_fock (t_info=..., p=warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
..., wdes=..., grid=..., latt_cur=..., lmdim=Cannot access memory at address 0x1
) at fock.F:1669
#8 fock::setup_fock (t_info=..., p=warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
..., wdes=..., grid=..., latt_cur=..., lmdim=Cannot access memory at address 0x1
) at fock.F:1413
#9 0x0000000002976478 in vamp () at main.F:2093
#10 0x0000000000412f9e in main ()
#11 0x000000383a41ed1d in __libc_start_main () from /lib64/libc.so.6
#12 0x0000000000412ea9 in _start ()

hung in mpi_finalize:

#0 0x000000383a4acbdd in nanosleep () from /lib64/libc.so.6
#1 0x000000383a4e1d94 in usleep () from /lib64/libc.so.6
#2 0x00002b11db1e0ae7 in ompi_mpi_finalize () from /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#3 0x00002b11daf8b399 in pmpi_finalize__ () from /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi_mpifh.so.20
#4 0x00000000004199c5 in m_exit () at mpi.F:375
#5 0x0000000000dab17f in full_kpoints::set_indpw_full (grid=..., wdes=Cannot resolve DW_OP_push_object_address for a missing object
) at mkpoints_full.F:1065
#6 0x0000000001441654 in set_indpw_fock (t_info=..., p=Cannot resolve DW_OP_push_object_address for a missing object
) at fock.F:1669
#7 fock::setup_fock (t_info=..., p=Cannot resolve DW_OP_push_object_address for a missing object
) at fock.F:1413
#8 0x0000000002976478 in vamp () at main.F:2093
#9 0x0000000000412f9e in main ()
#10 0x000000383a41ed1d in __libc_start_main () from /lib64/libc.so.6
#11 0x0000000000412ea9 in _start ()



____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
Gilles Gouaillardet
2016-12-09 08:38:31 UTC
Permalink
Folks,

the problem is indeed pretty trivial to reproduce

i opened https://github.com/open-mpi/ompi/issues/2550 (and included a
reproducer)


Cheers,

Gilles

On Fri, Dec 9, 2016 at 5:15 AM, Noam Bernstein
On Dec 8, 2016, at 6:05 AM, Gilles Gouaillardet
Christof,
There is something really odd with this stack trace.
count is zero, and some pointers do not point to valid addresses (!)
in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that
the stack has been corrupted inside MPI_Allreduce(), or that you are not
using the library you think you use
pmap <pid> will show you which lib is used
btw, this was not started with
mpirun --mca coll ^tuned ...
right ?
just to make it clear ...
a task from your program bluntly issues a fortran STOP, and this is kind of a feature.
the *only* issue is mpirun does not kill the other MPI tasks and mpirun never completes.
did i get it right ?
I just ran across very similar behavior in VASP (which we just switched over
to openmpi 2.0.1), also in a allreduce + STOP combination (some nodes call
one, others call the other), and I discovered several interesting things.
The most important is that when MPI is active, the preprocessor converts
(via a #define in symbol.inc) fortran STOP into calls to m_exit() (defined
in mpi.F), which is a wrapper around mpi_finalize. So in my case some
processes in the communicator call mpi_finalize, others call mpi_allreduce.
I’m not really surprised this hangs, because I think the correct thing to
replace STOP with is mpi_abort, not mpi_finalize. If you know where the
STOP is called, you can check the preprocessed equivalent file (.f90 instead
of .F), and see if it’s actually been replaced with a call to m_exit. I’m
planning to test whether replacing m_exit with m_stop in symbol.inc gives
more sensible behavior, i.e. program termination when the original source
file executes a STOP.
I’m assuming that a mix of mpi_allreduce and mpi_finalize is really expected
(gdb) where
#0 0x00002b8d5a095ec6 in opal_progress () from
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.20
#1 0x00002b8d59b3a36d in ompi_request_default_wait_all () from
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#2 0x00002b8d59b8107c in ompi_coll_base_allreduce_intra_recursivedoubling
() from /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#3 0x00002b8d59b495ac in PMPI_Allreduce () from
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#4 0x00002b8d598e4027 in pmpi_allreduce__ () from
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi_mpifh.so.20
#5 0x0000000000414077 in m_sum_i (comm=..., ivec=warning: Range for type
(null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
..., n=2) at mpi.F:989
#6 0x0000000000daac54 in full_kpoints::set_indpw_full (grid=..., wdes=...,
kpoints_f=...) at mkpoints_full.F:1099
#7 0x0000000001441654 in set_indpw_fock (t_info=..., p=warning: Range for
type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
..., wdes=..., grid=..., latt_cur=..., lmdim=Cannot access memory at address 0x1
) at fock.F:1669
#8 fock::setup_fock (t_info=..., p=warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
..., wdes=..., grid=..., latt_cur=..., lmdim=Cannot access memory at address 0x1
) at fock.F:1413
#9 0x0000000002976478 in vamp () at main.F:2093
#10 0x0000000000412f9e in main ()
#11 0x000000383a41ed1d in __libc_start_main () from /lib64/libc.so.6
#12 0x0000000000412ea9 in _start ()
#0 0x000000383a4acbdd in nanosleep () from /lib64/libc.so.6
#1 0x000000383a4e1d94 in usleep () from /lib64/libc.so.6
#2 0x00002b11db1e0ae7 in ompi_mpi_finalize () from
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#3 0x00002b11daf8b399 in pmpi_finalize__ () from
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi_mpifh.so.20
#4 0x00000000004199c5 in m_exit () at mpi.F:375
#5 0x0000000000dab17f in full_kpoints::set_indpw_full (grid=...,
wdes=Cannot resolve DW_OP_push_object_address for a missing object
) at mkpoints_full.F:1065
#6 0x0000000001441654 in set_indpw_fock (t_info=..., p=Cannot resolve
DW_OP_push_object_address for a missing object
) at fock.F:1669
#7 fock::setup_fock (t_info=..., p=Cannot resolve DW_OP_push_object_address
for a missing object
) at fock.F:1413
#8 0x0000000002976478 in vamp () at main.F:2093
#9 0x0000000000412f9e in main ()
#10 0x000000383a41ed1d in __libc_start_main () from /lib64/libc.so.6
#11 0x0000000000412ea9 in _start ()
____________
|
|
|
U.S. NAVAL
|
|
_RESEARCH_
|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Christof Koehler
2016-12-09 08:39:53 UTC
Permalink
Hello,

our case is. The libwannier.a is a "third party"
library which is built seperately and the just linked in. So the vasp
preprocessor never touches it. As far as I can see no preprocessing of
the f90 source is involved in the libwannier build process.

I finally managed to set a breakpoint at the program exit of the root
rank:

(gdb) bt
#0 0x00002b7ccd2e4220 in _exit () from /lib64/libc.so.6
#1 0x00002b7ccd25ee2b in __run_exit_handlers () from /lib64/libc.so.6
#2 0x00002b7ccd25eeb5 in exit () from /lib64/libc.so.6
#3 0x000000000407298d in for_stop_core ()
#4 0x00000000012fad41 in w90_io_mp_io_error_ ()
#5 0x0000000001302147 in w90_parameters_mp_param_read_ ()
#6 0x00000000012f49c6 in wannier_setup_ ()
#7 0x0000000000e166a8 in mlwf_mp_mlwf_wannier90_ ()
#8 0x00000000004319ff in vamp () at main.F:2640
#9 0x000000000040d21e in main ()
#10 0x00002b7ccd247b15 in __libc_start_main () from /lib64/libc.so.6
#11 0x000000000040d129 in _start ()

So for_stop_core is called apparently ? Of course it is below the main()
process of vasp, so additional things might happen which are not
visible. Is SIGCHILD (as observed when catching signals in mpirun) the
signal expectd after a for_stop_core ?

Thank you very much for investigating this !

Cheers

Christof
Post by Noam Bernstein
Christof,
There is something really odd with this stack trace.
count is zero, and some pointers do not point to valid addresses (!)
in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that
the stack has been corrupted inside MPI_Allreduce(), or that you are not using the library you think you use
pmap <pid> will show you which lib is used
btw, this was not started with
mpirun --mca coll ^tuned ...
right ?
just to make it clear ...
a task from your program bluntly issues a fortran STOP, and this is kind of a feature.
the *only* issue is mpirun does not kill the other MPI tasks and mpirun never completes.
did i get it right ?
I just ran across very similar behavior in VASP (which we just switched over to openmpi 2.0.1), also in a allreduce + STOP combination (some nodes call one, others call the other), and I discovered several interesting things.
The most important is that when MPI is active, the preprocessor converts (via a #define in symbol.inc) fortran STOP into calls to m_exit() (defined in mpi.F), which is a wrapper around mpi_finalize. So in my case some processes in the communicator call mpi_finalize, others call mpi_allreduce. I’m not really surprised this hangs, because I think the correct thing to replace STOP with is mpi_abort, not mpi_finalize. If you know where the STOP is called, you can check the preprocessed equivalent file (.f90 instead of .F), and see if it’s actually been replaced with a call to m_exit. I’m planning to test whether replacing m_exit with m_stop in symbol.inc gives more sensible behavior, i.e. program termination when the original source file executes a STOP.
(gdb) where
#0 0x00002b8d5a095ec6 in opal_progress () from /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.20
#1 0x00002b8d59b3a36d in ompi_request_default_wait_all () from /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#2 0x00002b8d59b8107c in ompi_coll_base_allreduce_intra_recursivedoubling () from /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#3 0x00002b8d59b495ac in PMPI_Allreduce () from /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#4 0x00002b8d598e4027 in pmpi_allreduce__ () from /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi_mpifh.so.20
#5 0x0000000000414077 in m_sum_i (comm=..., ivec=warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
..., n=2) at mpi.F:989
#6 0x0000000000daac54 in full_kpoints::set_indpw_full (grid=..., wdes=..., kpoints_f=...) at mkpoints_full.F:1099
#7 0x0000000001441654 in set_indpw_fock (t_info=..., p=warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
..., wdes=..., grid=..., latt_cur=..., lmdim=Cannot access memory at address 0x1
) at fock.F:1669
#8 fock::setup_fock (t_info=..., p=warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
..., wdes=..., grid=..., latt_cur=..., lmdim=Cannot access memory at address 0x1
) at fock.F:1413
#9 0x0000000002976478 in vamp () at main.F:2093
#10 0x0000000000412f9e in main ()
#11 0x000000383a41ed1d in __libc_start_main () from /lib64/libc.so.6
#12 0x0000000000412ea9 in _start ()
#0 0x000000383a4acbdd in nanosleep () from /lib64/libc.so.6
#1 0x000000383a4e1d94 in usleep () from /lib64/libc.so.6
#2 0x00002b11db1e0ae7 in ompi_mpi_finalize () from /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#3 0x00002b11daf8b399 in pmpi_finalize__ () from /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi_mpifh.so.20
#4 0x00000000004199c5 in m_exit () at mpi.F:375
#5 0x0000000000dab17f in full_kpoints::set_indpw_full (grid=..., wdes=Cannot resolve DW_OP_push_object_address for a missing object
) at mkpoints_full.F:1065
#6 0x0000000001441654 in set_indpw_fock (t_info=..., p=Cannot resolve DW_OP_push_object_address for a missing object
) at fock.F:1669
#7 fock::setup_fock (t_info=..., p=Cannot resolve DW_OP_push_object_address for a missing object
) at fock.F:1413
#8 0x0000000002976478 in vamp () at main.F:2093
#9 0x0000000000412f9e in main ()
#10 0x000000383a41ed1d in __libc_start_main () from /lib64/libc.so.6
#11 0x0000000000412ea9 in _start ()
____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
--
Dr. rer. nat. Christof Köhler email: ***@bccms.uni-bremen.de
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen

PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
Noam Bernstein
2016-12-09 13:19:27 UTC
Permalink
Post by Christof Koehler
Hello,
our case is. The libwannier.a is a "third party"
library which is built seperately and the just linked in. So the vasp
preprocessor never touches it. As far as I can see no preprocessing of
the f90 source is involved in the libwannier build process.
I finally managed to set a breakpoint at the program exit of the root
Looks like my case really was just VASP’s fault, and really I’d call it a VASP bug (you shouldn't call mpi_finalize from a subset of the tasks). Yours is similar, but not actually the same, since it’s actually trying to stop the task, and one would at least hope that OpenMPI could detect it and exit.

Noam


____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
Gilles Gouaillardet
2016-12-12 00:32:25 UTC
Permalink
Christof,


Ralph fixed the issue,

meanwhile, the patch can be manually downloaded at
https://patch-diff.githubusercontent.com/raw/open-mpi/ompi/pull/2552.patch


Cheers,


Gilles
Post by Christof Koehler
Hello,
our case is. The libwannier.a is a "third party"
library which is built seperately and the just linked in. So the vasp
preprocessor never touches it. As far as I can see no preprocessing of
the f90 source is involved in the libwannier build process.
I finally managed to set a breakpoint at the program exit of the root
(gdb) bt
#0 0x00002b7ccd2e4220 in _exit () from /lib64/libc.so.6
#1 0x00002b7ccd25ee2b in __run_exit_handlers () from /lib64/libc.so.6
#2 0x00002b7ccd25eeb5 in exit () from /lib64/libc.so.6
#3 0x000000000407298d in for_stop_core ()
#4 0x00000000012fad41 in w90_io_mp_io_error_ ()
#5 0x0000000001302147 in w90_parameters_mp_param_read_ ()
#6 0x00000000012f49c6 in wannier_setup_ ()
#7 0x0000000000e166a8 in mlwf_mp_mlwf_wannier90_ ()
#8 0x00000000004319ff in vamp () at main.F:2640
#9 0x000000000040d21e in main ()
#10 0x00002b7ccd247b15 in __libc_start_main () from /lib64/libc.so.6
#11 0x000000000040d129 in _start ()
So for_stop_core is called apparently ? Of course it is below the main()
process of vasp, so additional things might happen which are not
visible. Is SIGCHILD (as observed when catching signals in mpirun) the
signal expectd after a for_stop_core ?
Thank you very much for investigating this !
Cheers
Christof
Post by Noam Bernstein
Christof,
There is something really odd with this stack trace.
count is zero, and some pointers do not point to valid addresses (!)
in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that
the stack has been corrupted inside MPI_Allreduce(), or that you are not using the library you think you use
pmap <pid> will show you which lib is used
btw, this was not started with
mpirun --mca coll ^tuned ...
right ?
just to make it clear ...
a task from your program bluntly issues a fortran STOP, and this is kind of a feature.
the *only* issue is mpirun does not kill the other MPI tasks and mpirun never completes.
did i get it right ?
I just ran across very similar behavior in VASP (which we just switched over to openmpi 2.0.1), also in a allreduce + STOP combination (some nodes call one, others call the other), and I discovered several interesting things.
The most important is that when MPI is active, the preprocessor converts (via a #define in symbol.inc) fortran STOP into calls to m_exit() (defined in mpi.F), which is a wrapper around mpi_finalize. So in my case some processes in the communicator call mpi_finalize, others call mpi_allreduce. I’m not really surprised this hangs, because I think the correct thing to replace STOP with is mpi_abort, not mpi_finalize. If you know where the STOP is called, you can check the preprocessed equivalent file (.f90 instead of .F), and see if it’s actually been replaced with a call to m_exit. I’m planning to test whether replacing m_exit with m_stop in symbol.inc gives more sensible behavior, i.e. program termination when the original source file executes a STOP.
(gdb) where
#0 0x00002b8d5a095ec6 in opal_progress () from /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.20
#1 0x00002b8d59b3a36d in ompi_request_default_wait_all () from /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#2 0x00002b8d59b8107c in ompi_coll_base_allreduce_intra_recursivedoubling () from /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#3 0x00002b8d59b495ac in PMPI_Allreduce () from /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#4 0x00002b8d598e4027 in pmpi_allreduce__ () from /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi_mpifh.so.20
#5 0x0000000000414077 in m_sum_i (comm=..., ivec=warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
..., n=2) at mpi.F:989
#6 0x0000000000daac54 in full_kpoints::set_indpw_full (grid=..., wdes=..., kpoints_f=...) at mkpoints_full.F:1099
#7 0x0000000001441654 in set_indpw_fock (t_info=..., p=warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
..., wdes=..., grid=..., latt_cur=..., lmdim=Cannot access memory at address 0x1
) at fock.F:1669
#8 fock::setup_fock (t_info=..., p=warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
..., wdes=..., grid=..., latt_cur=..., lmdim=Cannot access memory at address 0x1
) at fock.F:1413
#9 0x0000000002976478 in vamp () at main.F:2093
#10 0x0000000000412f9e in main ()
#11 0x000000383a41ed1d in __libc_start_main () from /lib64/libc.so.6
#12 0x0000000000412ea9 in _start ()
#0 0x000000383a4acbdd in nanosleep () from /lib64/libc.so.6
#1 0x000000383a4e1d94 in usleep () from /lib64/libc.so.6
#2 0x00002b11db1e0ae7 in ompi_mpi_finalize () from /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#3 0x00002b11daf8b399 in pmpi_finalize__ () from /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi_mpifh.so.20
#4 0x00000000004199c5 in m_exit () at mpi.F:375
#5 0x0000000000dab17f in full_kpoints::set_indpw_full (grid=..., wdes=Cannot resolve DW_OP_push_object_address for a missing object
) at mkpoints_full.F:1065
#6 0x0000000001441654 in set_indpw_fock (t_info=..., p=Cannot resolve DW_OP_push_object_address for a missing object
) at fock.F:1669
#7 fock::setup_fock (t_info=..., p=Cannot resolve DW_OP_push_object_address for a missing object
) at fock.F:1413
#8 0x0000000002976478 in vamp () at main.F:2093
#9 0x0000000000412f9e in main ()
#10 0x000000383a41ed1d in __libc_start_main () from /lib64/libc.so.6
#11 0x0000000000412ea9 in _start ()
____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Noam Bernstein
2016-12-07 22:28:49 UTC
Permalink
Post by Christof Koehler
Post by Noam Bernstein
Presumably someone here can comment on what the standard says about the validity of terminating without mpi_abort.
Well, probably stop is not a good way to terminate then.
My main point was the change relative to 1.10 anyway :-)
It’s definitely not the clean way to terminate, but I think everyone agrees that it shouldn’t hang if it can be avoided.
Post by Christof Koehler
Post by Noam Bernstein
Actually, if you’re willing to share enough input files to reproduce, I could take a look. I just recompiled our VASP with openmpi 2.0.1 to fix a crash that was apparently addressed by some change in the memory allocator in a recent version of openmpi. Just e-mail me if that’s the case.
I think that is no longer necessary ? In principle it is no problem but
it at the end of a (small) GW calculation, the Si tutorial example.
So the mail would be abit larger due to the WAVECAR.
I agree. It sounds like it’s clearly a failure to exit from a collective communication when a process dies (from the point of view of mpi, since mpi_abort is not being called, it’s just a process dying). Maybe the patch in Ralph’s e-mail fixes it.

Noam


____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
Loading...