Discussion:
[OMPI users] Why might MPI_Recv trip PSM_MQ_RECVREQS_MAX ?
Jonathan Wesley Stone
2010-03-07 21:17:33 UTC
Permalink
Hi,

My supercomputer has OpenMPI 1.4. I am running into a frustrating
problem with my MPI program. I am using only the following calls,
which I expect to be blocking:
MPI_Wtime
MPI_Error_string
MPI_Abort
MPI_Send
MPI_Get_count
MPI_Recv
MPI_Probe
MPI_Init
MPI_Comm_rank
MPI_Comm_size
MPI_Finalize

Somehow I am getting this error when I do a large number of sequential
communications: "c002:2.0.Exhausted 1048576 MQ irecv request
descriptors, which usually indicates a user program error or
insufficient request descriptors (PSM_MQ_RECVREQS_MAX=1048576)"

This seems counter-intuitive to me because I don't think I should be
using irecvs since I am wanting specifically to rely on the documented
blocking behavior of MPI_Recv (not MPI_Irecv, which I am not using).

My main program is quite large, however I have managed to replicate
the irritating behavior in this much smaller program, which executes a
number of MPI_Send or MPI_Recv calls in a loop. The program's default
behaviour is to run 2,000,000 iterations. When I turn it up to
20,000,000, after a short time it generates the PSM_MQ_RECVREQS_MAX
exception.

I would appreciate if anyone could advise why it might be happening in
this "test" case -- basically what is going on that causes my
presumably blocking MPI_Recv calls to "accumulate" such a large number
of "irecv request descriptors", when I expect they should be blocking
and get immediately resolved and the count should go down when the
matching MPI_Send is posted.

I appreciate your assistance. Thank you!

Jonathan Stone
Research Assistant, U. Oklahoma
Rainer Keller
2010-03-08 14:22:10 UTC
Permalink
Hello Jonathan,
Your are using Infinipath's PSM library and the corresponding MTL/psm and
therefore the corresponding upper-layer PML/cm.
In fact, this _is_ calling into the psm's irecv() function, which explains the
error triggered in the psm library.

Not knowing the degree of parallelism of Your application otherwise, apart
from trying to increase the max. recv requests using the environment variable,
You might want to change some of the master send to synchronous MPI_Ssend().

On the other hand, the example code You posted could be written differently,
e.g. collect multiple random numbers into one communication, or using
collective communication, here with sub-communicators containing the master
and sources and master and targets, all of which would reduce pressure on the
master.

Hope this helps.

Best regards,
Rainer
Post by Jonathan Wesley Stone
Hi,
My supercomputer has OpenMPI 1.4. I am running into a frustrating
problem with my MPI program. I am using only the following calls,
MPI_Wtime
MPI_Error_string
MPI_Abort
MPI_Send
MPI_Get_count
MPI_Recv
MPI_Probe
MPI_Init
MPI_Comm_rank
MPI_Comm_size
MPI_Finalize
Somehow I am getting this error when I do a large number of sequential
communications: "c002:2.0.Exhausted 1048576 MQ irecv request
descriptors, which usually indicates a user program error or
insufficient request descriptors (PSM_MQ_RECVREQS_MAX=1048576)"
This seems counter-intuitive to me because I don't think I should be
using irecvs since I am wanting specifically to rely on the documented
blocking behavior of MPI_Recv (not MPI_Irecv, which I am not using).
My main program is quite large, however I have managed to replicate
the irritating behavior in this much smaller program, which executes a
number of MPI_Send or MPI_Recv calls in a loop. The program's default
behaviour is to run 2,000,000 iterations. When I turn it up to
20,000,000, after a short time it generates the PSM_MQ_RECVREQS_MAX
exception.
I would appreciate if anyone could advise why it might be happening in
this "test" case -- basically what is going on that causes my
presumably blocking MPI_Recv calls to "accumulate" such a large number
of "irecv request descriptors", when I expect they should be blocking
and get immediately resolved and the count should go down when the
matching MPI_Send is posted.
I appreciate your assistance. Thank you!
Jonathan Stone
Research Assistant, U. Oklahoma
--
------------------------------------------------------------------------
Rainer Keller, PhD Tel: +1 (865) 241-6293
Oak Ridge National Lab Fax: +1 (865) 241-4811
PO Box 2008 MS 6164 Email: ***@ornl.gov
Oak Ridge, TN 37831-2008 AIM/Skype: rusraink
Richard Treumann
2010-03-08 18:01:40 UTC
Permalink
This post might be inappropriate. Click to display it.
Loading...