[OMPI users] mpi send/recv pair hangin

Discussion:

Noam Bernstein

2018-04-05 14:16:21 UTC

Hi all - I have a code that uses MPI (vasp), and itâs hanging in a strange way. Basically, thereâs a Cartesian communicator, 4x16 (64 processes total), and despite the fact that the communication pattern is rather regular, one particular send/recv pair hangs consistently. Basically, across each row of 4, task 0 receives from 1,2,3, and tasks 1,2,3 send to 0. On most of the 16 such sets all those send/recv pairs complete. However, on 2 of them, it hangs (both the send and recv). I have stack traces (with gdb -p on the running processes) from what I believe are corresponding send/recv pairs.

receiving:
0x00002b06eeed0eb2 in ?? () from /usr/lib64/libmlx4-rdmav2.so
#0 0x00002b06eeed0eb2 in ?? () from /usr/lib64/libmlx4-rdmav2.so
#1 0x00002b06f0a5d2de in poll_device () from /usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/openmpi/mca_btl_openib.so
#2 0x00002b06f0a5e0af in btl_openib_component_progress () from /usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/openmpi/mca_btl_openib.so
#3 0x00002b06dd3c00b0 in opal_progress () from /usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/libopen-pal.so.40
#4 0x00002b06f1c9232d in mca_pml_ob1_recv () from /usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/openmpi/mca_pml_ob1.so
#5 0x00002b06dce56bb7 in PMPI_Recv () from /usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/libmpi.so.40
#6 0x00002b06dcbd1e0b in pmpi_recv__ () from /usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/libmpi_mpifh.so.40
#7 0x000000000042887b in m_recv_z (comm=..., node=-858993460, zvec=) at mpi.F:680
#8 0x000000000123e0b7 in fileio::outwav (io=..., wdes=..., w=) at fileio.F:952
#9 0x0000000002abfccf in vamp () at main.F:4204
#10 0x00000000004139de in main ()
#11 0x000000314561ed1d in __libc_start_main () from /lib64/libc.so.6
#12 0x00000000004138e9 in _start ()
sending:
0x00002abc32ed0ea1 in ?? () from /usr/lib64/libmlx4-rdmav2.so
#0 0x00002abc32ed0ea1 in ?? () from /usr/lib64/libmlx4-rdmav2.so
#1 0x00002abc34a5d2de in poll_device () from /usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/openmpi/mca_btl_openib.so
#2 0x00002abc34a5e0af in btl_openib_component_progress () from /usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/openmpi/mca_btl_openib.so
#3 0x00002abc238800b0 in opal_progress () from /usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/libopen-pal.so.40
#4 0x00002abc35c95955 in mca_pml_ob1_send () from /usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/openmpi/mca_pml_ob1.so
#5 0x00002abc2331c412 in PMPI_Send () from /usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/libmpi.so.40
#6 0x00002abc230927e0 in pmpi_send__ () from /usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/libmpi_mpifh.so.40
#7 0x0000000000428798 in m_send_z (comm=..., node=) at mpi.F:655
#8 0x000000000123d0a9 in fileio::outwav (io=..., wdes=) at fileio.F:942
#9 0x0000000002abfccf in vamp () at main.F:4204
#10 0x00000000004139de in main ()
#11 0x0000003cec81ed1d in __libc_start_main () from /lib64/libc.so.6
#12 0x00000000004138e9 in _start ()

This is with OpenMPI 3.0.1 (same for 3.0.0, havenât checked older versions), Intel compilers (17.2.174). It seems to be independent of which nodes, always happens on this pair of calls and happens after the code has been running for a while, and the same code for the other 14 sets of 4 work fine, suggesting that itâs an MPI issue, rather than an obvious bug in this code or a hardware problem. Does anyone have any ideas, either about possible causes or how to debug things further?

thanks,
Noam

Reuti

2018-04-05 15:03:56 UTC