Discussion:
[OMPI users] alltoallv
Michael Di Domenico
2017-10-10 15:57:51 UTC
Permalink
i'm getting stuck trying to run some fairly large IMB-MPI alltoall
tests under openmpi 2.0.2 on rhel 7.4

i have two different clusters, one running mellanox fdr10 and one
running qlogic qdr

if i issue

mpirun -n 1024 ./IMB-MPI1 -npmin 1024 -iter 1 -mem 2.001 alltoallv

the job just stalls after the "List of Benchmarks to run: Alltoallv"
line outputs from IMB-MPI

if i switch it to alltoall the test does progress

often when running various size alltoall's i'll get

"too many retries sending message to <>:<>, giving up

i'm able to use infiniband just fine (our lustre filesystem mounts
over it) and i have other mpi programs running

it only seems to stem when i run alltoall type primitives

any thoughts on debugging where the failures are, i might just need to
turn up the debugging, but i'm not sure where
Peter Kjellström
2017-10-11 16:04:12 UTC
Permalink
On Tue, 10 Oct 2017 11:57:51 -0400
Post by Michael Di Domenico
i'm getting stuck trying to run some fairly large IMB-MPI alltoall
tests under openmpi 2.0.2 on rhel 7.4
What is the IB stack used, just RHEL inbox?

Do you run openmpi on the psm mtl for qlogic and openib btl for
mellanox or something different?
Post by Michael Di Domenico
i have two different clusters, one running mellanox fdr10 and one
running qlogic qdr
if i issue
mpirun -n 1024 ./IMB-MPI1 -npmin 1024 -iter 1 -mem 2.001 alltoallv
Does it work if you run with something that more obviously fits in RAM?
Like "-mem 0.2"

/Peter K

Loading...