Discussion:
[OMPI users] 3.x - hang in MPI_Comm_disconnect
Ben Menadue
2018-05-16 23:59:43 UTC
Permalink
Hi,

I’m trying to debug a user’s program that uses dynamic process management through Rmpi + doMPI. We’re seeing a hang in MPI_Comm_disconnect. Each of the processes is in

#0 0x00007ff72513168c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007ff7130760d3 in PMIx_Disconnect (procs=0x5b0d600, nprocs=<value optimized out>, info=<value optimized out>, ninfo=0) at ../../src/client/pmix_client_connect.c:232
#2 0x00007ff7132fa670 in ext2x_disconnect (procs=0x7fff48ed6700) at ext2x_client.c:1432
#3 0x00007ff71a3b7ce4 in ompi_dpm_disconnect (comm=0x5af6910) at ../../../../../ompi/dpm/dpm.c:596
#4 0x00007ff71a402ff8 in PMPI_Comm_disconnect (comm=0x5a3c4f8) at pcomm_disconnect.c:67
#5 0x00007ff71a7466b9 in mpi_comm_disconnect () from /home/900/bjm900/R/x86_64-pc-linux-gnu-library/3.4/Rmpi/libs/Rmpi.so

This is using 3.1.0 against and external install of PMIx 2.1.1. But I see exactly the same issue with 3.0.1 using its internal PMIx. It looks similar to issue #4542, but the corresponding patch in PR#4549 doesn’t seem to help (it just hangs in PMIx_fence instead of PMIx_disconnect).

Attached is the offending R script, it hangs in the “closeCluster” call. Has anyone seen this issue? I’m not sure what approach to take to debug it, but I have builds of the MPI libraries with --enable-debug available if needed.

Cheers,
Ben

Loading...