Ben Menadue
2018-05-16 23:59:43 UTC
Hi,
Iâm trying to debug a userâs program that uses dynamic process management through Rmpi + doMPI. Weâre seeing a hang in MPI_Comm_disconnect. Each of the processes is in
#0 0x00007ff72513168c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007ff7130760d3 in PMIx_Disconnect (procs=0x5b0d600, nprocs=<value optimized out>, info=<value optimized out>, ninfo=0) at ../../src/client/pmix_client_connect.c:232
#2 0x00007ff7132fa670 in ext2x_disconnect (procs=0x7fff48ed6700) at ext2x_client.c:1432
#3 0x00007ff71a3b7ce4 in ompi_dpm_disconnect (comm=0x5af6910) at ../../../../../ompi/dpm/dpm.c:596
#4 0x00007ff71a402ff8 in PMPI_Comm_disconnect (comm=0x5a3c4f8) at pcomm_disconnect.c:67
#5 0x00007ff71a7466b9 in mpi_comm_disconnect () from /home/900/bjm900/R/x86_64-pc-linux-gnu-library/3.4/Rmpi/libs/Rmpi.so
This is using 3.1.0 against and external install of PMIx 2.1.1. But I see exactly the same issue with 3.0.1 using its internal PMIx. It looks similar to issue #4542, but the corresponding patch in PR#4549 doesnât seem to help (it just hangs in PMIx_fence instead of PMIx_disconnect).
Attached is the offending R script, it hangs in the âcloseClusterâ call. Has anyone seen this issue? Iâm not sure what approach to take to debug it, but I have builds of the MPI libraries with --enable-debug available if needed.
Cheers,
Ben
Iâm trying to debug a userâs program that uses dynamic process management through Rmpi + doMPI. Weâre seeing a hang in MPI_Comm_disconnect. Each of the processes is in
#0 0x00007ff72513168c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007ff7130760d3 in PMIx_Disconnect (procs=0x5b0d600, nprocs=<value optimized out>, info=<value optimized out>, ninfo=0) at ../../src/client/pmix_client_connect.c:232
#2 0x00007ff7132fa670 in ext2x_disconnect (procs=0x7fff48ed6700) at ext2x_client.c:1432
#3 0x00007ff71a3b7ce4 in ompi_dpm_disconnect (comm=0x5af6910) at ../../../../../ompi/dpm/dpm.c:596
#4 0x00007ff71a402ff8 in PMPI_Comm_disconnect (comm=0x5a3c4f8) at pcomm_disconnect.c:67
#5 0x00007ff71a7466b9 in mpi_comm_disconnect () from /home/900/bjm900/R/x86_64-pc-linux-gnu-library/3.4/Rmpi/libs/Rmpi.so
This is using 3.1.0 against and external install of PMIx 2.1.1. But I see exactly the same issue with 3.0.1 using its internal PMIx. It looks similar to issue #4542, but the corresponding patch in PR#4549 doesnât seem to help (it just hangs in PMIx_fence instead of PMIx_disconnect).
Attached is the offending R script, it hangs in the âcloseClusterâ call. Has anyone seen this issue? Iâm not sure what approach to take to debug it, but I have builds of the MPI libraries with --enable-debug available if needed.
Cheers,
Ben