[OMPI users] OpenMPI + gadget/gizmo/arepo
Charles A Taylor
2018-05-23 16:30:27 UTC
I feel a little funny posting this but I have observed this problem now over three different versions of OpenMPI (1.10.2, 2.0.3, 3.0.0) and have refrained from asking about it before now because we always had a work-around. That may not be the case now and feel like I’m missing something obvious.

I’ve tried to summarize our system configuration as succinctly as possible below but it is a pretty standard Linux cluster with an IB interconnect (mellanox).

In short, we run many MPI applications (LAMMPS, VASP, NAMD, AMBER, ENZO, etc) successfully. However, the astrophysical galaxy modeling codes Arepo and Gizmo (both Gadget derivatives) seem to give us fits - deadlocking randomly after running for hours or days. I’ve tracked this down to a deadlock with some processes in MPI_Waitall() and others in MPI_Sendrecv(). I’ve looked at the code where the processes deadlock and can’t see any obvious issue. I also know that the same versions of the same codes are run on other, similar platforms at other sites (TACC, NASA, for example).

While trying various things over the last few days I have learned that setting

export OMPI_MCA_btl_openib_flags="send,fetching-atomics,need-ack,need-csum,hetero-rdma”

seems to avoid the deadlocks. In other words, disabling RDMA read/write seems to avoid the deadlocks. Perhaps some RDMA read/write tuning is in order but I’ve had no success with that so far.

There are a couple of MPI related ifdefs in the code with regard to MPI_IN_PLACE and async sendrecv(). I’ve experimented with both. Prior to OpenMPI 3.0.0 the gizmo code would run without deadlocking if -DNO_ISEND_IRECV_IN_DOMAIN was used at build time. Under OpenMPI 3.0.0 that is no longer the case.

FWIW, I also know that gizmo runs (on our system) using intel mpi (5.1.1) but I’m trying to avoid making that generally available since every other app we have works just fine with OpenMPI.

Anyone else have experience with these codes using OpenMPI (or otherwise)? Any comments or suggestions would be appreciated.


Charlie Taylor
UF Research Computing

Applications: Gadget derivatives gizmo and arepo
Problem: Random Deadlocks in MPI_waitall, MPI_sendrecv
Platform: RedHat EL7 (and RedHat EL6 previously)
Systems: Dell SOS6320, Haswell (2 x CPU E5-2698 v3 @ 2.30GHz)
Interconnect: Mellanox ConnectX-3 FDR (OpenSM fabric manager)
IB Stack: RedHat EL7.4 native
OpenMPI: 3.0.0 (currently - see configure options below, but the problem has been persistent across versions)
Compilers: Intel Suite (various versions - 2016, 2017, 2018)

Build time configure options.
CFG_OPTS="$CFG_OPTS C=icc CXX=icpc FC=ifort FFLAGS=\"-O2 -g -warn -m64\" LDFLAGS=\"\" "
CFG_OPTS="$CFG_OPTS --enable-static"
CFG_OPTS="$CFG_OPTS --enable-orterun-prefix-by-default"
CFG_OPTS="$CFG_OPTS --with-slurm=/opt/slurm"
CFG_OPTS="$CFG_OPTS --with-pmix=/opt/pmix"
CFG_OPTS="$CFG_OPTS --with-libevent=external"
CFG_OPTS="$CFG_OPTS --with-hwloc=external"
CFG_OPTS="$CFG_OPTS --with-verbs=/usr"
CFG_OPTS="$CFG_OPTS --with-verbs-libdir=/usr/lib64"
CFG_OPTS="$CFG_OPTS --with-mxm=no"
CFG_OPTS="$CFG_OPTS --with-cuda=${HPC_CUDA_DIR}"
CFG_OPTS="$CFG_OPTS --enable-openib-udcm"
CFG_OPTS="$CFG_OPTS --enable-openib-rdmacm"
