[OMPI users] Error using hpcc benchmark

Discussion:

wodel youchi

2017-01-31 13:25:37 UTC

Hi,

I am n newbie in HPC world

I am trying to execute the hpcc benchmark on our cluster, but every time I
start the job, I get this error, then the job exits

*compute017.22840Exhausted 1048576 MQ irecv request descriptors, which
usually indicates a user program error or insufficient request descriptors
(PSM_MQ_RECVREQS_MAX=1048576)compute024.22840Exhausted 1048576 MQ irecv
request descriptors, which usually indicates a user program error or
insufficient request descriptors
(PSM_MQ_RECVREQS_MAX=1048576)compute019.22847Exhausted 1048576 MQ irecv
request descriptors, which usually indicates a user program error or
insufficient request descriptors
(PSM_MQ_RECVREQS_MAX=1048576)-------------------------------------------------------Primary
job terminated normally, but 1 process returneda non-zero exit code.. Per
user-direction, the job has been
aborted.---------------------------------------------------------------------------------------------------------------------------------mpirun
detected that one or more processes exited with non-zero status, thus
causingthe job to be terminated. The first process to do so was: Process
name: [[19601,1],272] Exit code:
255--------------------------------------------------------------------------*

Platform : IBM PHPC
OS : RHEL 6.5
one management node
32 compute node : 16 cores, 32GB RAM, intel qlogic QLE7340 one port QRD
infiniband 40Gb/s

I compiled hpcc against : IBM MPI, Openmpi 2.0.1 (compiled with gcc 4.4.7)
and Openmpi 1.8.1 (compiled with gcc 4.4.7)

I get the errors, but each time on different compute nodes.

This is the command I used to start the job

*mpirun -np 512 --mca mtl psm --hostfile hosts32
/shared/build/hpcc-1.5.0b-blas-ompi-181/hpcc hpccinf.txt*

Any help will be appreciated, and if you need more details, let me know.
Thanks in advance.

Regards.

Howard Pritchard

2017-01-31 14:38:04 UTC

Permalink

Hi Wodel

Randomaccess part of HPCC is probably causing this.

Perhaps set PSM env. variable -

Export PSM_MQ_REVCREQ_MAX=10000000

or something like that.

Alternatively launch the job using

mpirun --mca plm ob1 --host ....

to avoid use of psm. Performance will probably suffer with this option
however.

Howard

Post by wodel youchi
Hi,
I am n newbie in HPC world
I am trying to execute the hpcc benchmark on our cluster, but every time I
start the job, I get this error, then the job exits
*compute017.22840Exhausted 1048576 MQ irecv request descriptors, which
usually indicates a user program error or insufficient request descriptors
(PSM_MQ_RECVREQS_MAX=1048576)compute024.22840Exhausted 1048576 MQ irecv
request descriptors, which usually indicates a user program error or
insufficient request descriptors
(PSM_MQ_RECVREQS_MAX=1048576)compute019.22847Exhausted 1048576 MQ irecv
request descriptors, which usually indicates a user program error or
insufficient request descriptors
(PSM_MQ_RECVREQS_MAX=1048576)-------------------------------------------------------Primary
job terminated normally, but 1 process returneda non-zero exit code.. Per
user-direction, the job has been
aborted.---------------------------------------------------------------------------------------------------------------------------------mpirun
detected that one or more processes exited with non-zero status, thus
causingthe job to be terminated. The first process to do so was: Process
255--------------------------------------------------------------------------*
Platform : IBM PHPC
OS : RHEL 6.5
one management node
32 compute node : 16 cores, 32GB RAM, intel qlogic QLE7340 one port QRD
infiniband 40Gb/s
I compiled hpcc against : IBM MPI, Openmpi 2.0.1 (compiled with gcc 4.4.7)
and Openmpi 1.8.1 (compiled with gcc 4.4.7)
I get the errors, but each time on different compute nodes.
This is the command I used to start the job
*mpirun -np 512 --mca mtl psm --hostfile hosts32
/shared/build/hpcc-1.5.0b-blas-ompi-181/hpcc hpccinf.txt*
Any help will be appreciated, and if you need more details, let me know.
Thanks in advance.
Regards.
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Cabral, Matias A

2017-01-31 16:55:05 UTC

Permalink

Hi Wodel,

As Howard mentioned, this is probably because many ranks and sending to a single one and exhausting the receive requests MQ. You can individually enlarge the receive/send requests queues with the specific variables (PSM_MQ_RECVREQS_MAX/ PSM_MQ_SENDREQS_MAX) or increase both with PSM_MEMORY=max. Note that the psm library will allocate more system memory for the queues.

Thanks,

_MAC

From: users [mailto:users-***@lists.open-mpi.org] On Behalf Of Howard Pritchard
Sent: Tuesday, January 31, 2017 6:38 AM
To: Open MPI Users <***@lists.open-mpi.org>
Subject: Re: [OMPI users] Error using hpcc benchmark

Hi Wodel

Randomaccess part of HPCC is probably causing this.

Perhaps set PSM env. variable -

Export PSM_MQ_REVCREQ_MAX=10000000

or something like that.

Alternatively launch the job using

mpirun --mca plm ob1 --host ....

to avoid use of psm. Performance will probably suffer with this option however.

Howard
wodel youchi <***@gmail.com<mailto:***@gmail.com>> schrieb am Di. 31. Jan. 2017 um 08:27:
Hi,
I am n newbie in HPC world
I am trying to execute the hpcc benchmark on our cluster, but every time I start the job, I get this error, then the job exits
compute017.22840Exhausted 1048576 MQ irecv request descriptors, which usually indicates a user program error or insufficient request descriptors (PSM_MQ_RECVREQS_MAX=1048576)
compute024.22840Exhausted 1048576 MQ irecv request descriptors, which usually indicates a user program error or insufficient request descriptors (PSM_MQ_RECVREQS_MAX=1048576)
compute019.22847Exhausted 1048576 MQ irecv request descriptors, which usually indicates a user program error or insufficient request descriptors (PSM_MQ_RECVREQS_MAX=1048576)
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[19601,1],272]
Exit code: 255
--------------------------------------------------------------------------
Platform : IBM PHPC
OS : RHEL 6.5
one management node
32 compute node : 16 cores, 32GB RAM, intel qlogic QLE7340 one port QRD infiniband 40Gb/s
I compiled hpcc against : IBM MPI, Openmpi 2.0.1 (compiled with gcc 4.4.7) and Openmpi 1.8.1 (compiled with gcc 4.4.7)
I get the errors, but each time on different compute nodes.
This is the command I used to start the job
mpirun -np 512 --mca mtl psm --hostfile hosts32 /shared/build/hpcc-1.5.0b-blas-ompi-181/hpcc hpccinf.txt

Any help will be appreciated, and if you need more details, let me know.
Thanks in advance.

Regards.
_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

wodel youchi

2017-02-01 11:35:34 UTC

Permalink

Hi,

Thank you for you replies, but :-) it didn't work for me.

Using hpcc compiled with OpenMPI 2.0.1 :
I tried to use *export **PSM_MQ_RECVREQS_MAX=10000000* as mentioned by
Howard, but the job didn't take into account the export (I am starting the
job from the home directory of a user, the home directory is shared by nfs
with all compute nodes).
I tried to use the .bash_profile to export the variable, but the job didn't
take it into account, I got the same error

*Exhausted 1048576 MQ irecv request descriptors, which usually indicates a
user program error or insufficient request descriptors
(PSM_MQ_RECVREQS_MAX=1048576)*
And as I mentioned before, each time on different node(s).

From the help of the mpirun command, I read that to pass an environment
variable we have to use *-x *with the commend; i.e. :
mpirun -np 512* -x **PSM_MQ_RECVREQS_MAX=10000000 *--mca mtl psm --hostfile
hosts32 /shared/build/hpcc-1.5.0b-blas-ompi-181/hpcc hpccinf.txt

But when tested, I get this errors

*PSM was unable to open an endpoint. Please make sure that the network link
isactive on the node and the hardware is functioning.Error: Ran out of
memory*

I tested with lower values, the only one that worked for me is *2097152 *which
is 2 times the default value of PSM_MQ..., but even with this value, I get
the same error with the new value, and the exits.

*Exhausted **2097152 MQ irecv request descriptors, which usually indicates
a user program error or insufficient request descriptors
(PSM_MQ_RECVREQS_MAX=**2097152 )*

PS: for Cabral, I didn't find any way to know the default value of *PSM_MEMORY
*to be able to modify it.

Any idea??? Could this be a problem on the infiniband configuration?

Does the mtu have anything to do with this problem ?

ibv_devinfo
hca_id: qib0
transport: InfiniBand (0)
fw_ver: 0.0.0
node_guid: 0011:7500:0070:59a6
sys_image_guid: 0011:7500:0070:59a6
vendor_id: 0x1175
vendor_part_id: 29474
hw_ver: 0x2
board_id: InfiniPath_QLE7340
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)

*max_mtu: 4096 (5)
active_mtu: 2048 (4)*
sm_lid: 1
port_lid: 1
port_lmc: 0x00
link_layer: InfiniBand

Regards.

Post by Cabral, Matias A
Hi Wodel,
As Howard mentioned, this is probably because many ranks and sending to a
single one and exhausting the receive requests MQ. You can individually
enlarge the receive/send requests queues with the specific variables
(PSM_MQ_RECVREQS_MAX/ PSM_MQ_SENDREQS_MAX) or increase both with
PSM_MEMORY=max. Note that the psm library will allocate more system memory
for the queues.
Thanks,
_MAC
Pritchard
*Sent:* Tuesday, January 31, 2017 6:38 AM
*Subject:* Re: [OMPI users] Error using hpcc benchmark
Hi Wodel
Randomaccess part of HPCC is probably causing this.
Perhaps set PSM env. variable -
Export PSM_MQ_REVCREQ_MAX=10000000
or something like that.
Alternatively launch the job using
mpirun --mca plm ob1 --host ....
to avoid use of psm. Performance will probably suffer with this option however.
Howard
Hi,
I am n newbie in HPC world
I am trying to execute the hpcc benchmark on our cluster, but every time I
start the job, I get this error, then the job exits
*compute017.22840Exhausted 1048576 MQ irecv request descriptors, which
usually indicates a user program error or insufficient request descriptors
(PSM_MQ_RECVREQS_MAX=1048576) compute024.22840Exhausted 1048576 MQ irecv
request descriptors, which usually indicates a user program error or
insufficient request descriptors (PSM_MQ_RECVREQS_MAX=1048576)
compute019.22847Exhausted 1048576 MQ irecv request descriptors, which
usually indicates a user program error or insufficient request descriptors
(PSM_MQ_RECVREQS_MAX=1048576)
------------------------------------------------------- Primary job
terminated normally, but 1 process returned a non-zero exit code.. Per
user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status,
Process name: [[19601,1],272] Exit code: 255
--------------------------------------------------------------------------*
Platform : IBM PHPC
OS : RHEL 6.5
one management node
32 compute node : 16 cores, 32GB RAM, intel qlogic QLE7340 one port QRD infiniband 40Gb/s
I compiled hpcc against : IBM MPI, Openmpi 2.0.1 (compiled with gcc 4.4.7)
and Openmpi 1.8.1 (compiled with gcc 4.4.7)
I get the errors, but each time on different compute nodes.
This is the command I used to start the job
*mpirun -np 512 --mca mtl psm --hostfile hosts32
/shared/build/hpcc-1.5.0b-blas-ompi-181/hpcc hpccinf.txt*
Any help will be appreciated, and if you need more details, let me know.
Thanks in advance.
Regards.
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Cabral, Matias A

2017-02-01 21:12:24 UTC

Permalink

Hi Wodel,

As you already figured out, mpirun -x <ENV_VAR=value> âŠ is the right way to do it so the psm library will read the values when initializing on every node.
The default value for "PSM_MEMORY" is ânormalâ and you may change it to âlargeâ. If you want to look inside the code, it is on https://github.com/01org/psm . One useful variable to play with is PSM_TRACEMASK (only set it on the head node) to see what values are being used. I think 0xffff will dump lots of info.
As I mentioned below, playing with the size of the MQ is tricky since will be using system memory. I think this will be a combination of a) number of total ranks and per node b) memory on the hosts c) HPCC parameters. The bigger number of ranks, more ranks will be possibly transmitting simultaneously to a single node (I would assume a reduction) (a node could be posting receives at faster rate it is completing them), so will need bigger MQ, so more memory used. Would you share the number of ranks per node, nodes, and memory per node to have an idea? A quick test could be to start with very small number of ranks to see if it runs.

Thanks,
Regards,

_MAC

From: users [mailto:users-***@lists.open-mpi.org] On Behalf Of wodel youchi
Sent: Wednesday, February 01, 2017 3:36 AM
To: Open MPI Users <***@lists.open-mpi.org>
Subject: Re: [OMPI users] Error using hpcc benchmark

Hi,
Thank you for you replies, but :-) it didn't work for me.
Using hpcc compiled with OpenMPI 2.0.1 :
I tried to use export PSM_MQ_RECVREQS_MAX=10000000 as mentioned by Howard, but the job didn't take into account the export (I am starting the job from the home directory of a user, the home directory is shared by nfs with all compute nodes).
I tried to use the .bash_profile to export the variable, but the job didn't take it into account, I got the same error
Exhausted 1048576 MQ irecv request descriptors, which usually indicates a user program error or insufficient request descriptors (PSM_MQ_RECVREQS_MAX=1048576)
And as I mentioned before, each time on different node(s).

From the help of the mpirun command, I read that to pass an environment variable we have to use -x with the commend; i.e. :
mpirun -np 512 -x PSM_MQ_RECVREQS_MAX=10000000 --mca mtl psm --hostfile hosts32 /shared/build/hpcc-1.5.0b-blas-ompi-181/hpcc hpccinf.txt

But when tested, I get this errors

PSM was unable to open an endpoint. Please make sure that the network link is
active on the node and the hardware is functioning.
Error: Ran out of memory
I tested with lower values, the only one that worked for me is 2097152 which is 2 times the default value of PSM_MQ..., but even with this value, I get the same error with the new value, and the exits.
Exhausted 2097152 MQ irecv request descriptors, which usually indicates a user program error or insufficient request descriptors (PSM_MQ_RECVREQS_MAX=2097152 )

PS: for Cabral, I didn't find any way to know the default value of PSM_MEMORY to be able to modify it.
Any idea??? Could this be a problem on the infiniband configuration?

Does the mtu have anything to do with this problem ?

ibv_devinfo
hca_id: qib0
transport: InfiniBand (0)
fw_ver: 0.0.0
node_guid: 0011:7500:0070:59a6
sys_image_guid: 0011:7500:0070:59a6
vendor_id: 0x1175
vendor_part_id: 29474
hw_ver: 0x2
board_id: InfiniPath_QLE7340
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 2048 (4)
sm_lid: 1
port_lid: 1
port_lmc: 0x00
link_layer: InfiniBand

Regards.

2017-01-31 17:55 GMT+01:00 Cabral, Matias A <***@intel.com<mailto:***@intel.com>>:
Hi Wodel,

As Howard mentioned, this is probably because many ranks and sending to a single one and exhausting the receive requests MQ. You can individually enlarge the receive/send requests queues with the specific variables (PSM_MQ_RECVREQS_MAX/ PSM_MQ_SENDREQS_MAX) or increase both with PSM_MEMORY=max. Note that the psm library will allocate more system memory for the queues.

Thanks,

_MAC

From: users [mailto:users-***@lists.open-mpi.org<mailto:users-***@lists.open-mpi.org>] On Behalf Of Howard Pritchard
Sent: Tuesday, January 31, 2017 6:38 AM
To: Open MPI Users <***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>>
Subject: Re: [OMPI users] Error using hpcc benchmark

Hi Wodel

Randomaccess part of HPCC is probably causing this.

Perhaps set PSM env. variable -

Export PSM_MQ_REVCREQ_MAX=10000000

or something like that.

Alternatively launch the job using

mpirun --mca plm ob1 --host ....

to avoid use of psm. Performance will probably suffer with this option however.

Howard
wodel youchi <***@gmail.com<mailto:***@gmail.com>> schrieb am Di. 31. Jan. 2017 um 08:27:
Hi,
I am n newbie in HPC world
I am trying to execute the hpcc benchmark on our cluster, but every time I start the job, I get this error, then the job exits
compute017.22840Exhausted 1048576 MQ irecv request descriptors, which usually indicates a user program error or insufficient request descriptors (PSM_MQ_RECVREQS_MAX=1048576)
compute024.22840Exhausted 1048576 MQ irecv request descriptors, which usually indicates a user program error or insufficient request descriptors (PSM_MQ_RECVREQS_MAX=1048576)
compute019.22847Exhausted 1048576 MQ irecv request descriptors, which usually indicates a user program error or insufficient request descriptors (PSM_MQ_RECVREQS_MAX=1048576)
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[19601,1],272]
Exit code: 255
--------------------------------------------------------------------------
Platform : IBM PHPC
OS : RHEL 6.5
one management node
32 compute node : 16 cores, 32GB RAM, intel qlogic QLE7340 one port QRD infiniband 40Gb/s
I compiled hpcc against : IBM MPI, Openmpi 2.0.1 (compiled with gcc 4.4.7) and Openmpi 1.8.1 (compiled with gcc 4.4.7)
I get the errors, but each time on different compute nodes.
This is the command I used to start the job
mpirun -np 512 --mca mtl psm --hostfile hosts32 /shared/build/hpcc-1.5.0b-blas-ompi-181/hpcc hpccinf.txt

Any help will be appreciated, and if you need more details, let me know.
Thanks in advance.

Regards.
_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

wodel youchi

2017-02-02 16:24:40 UTC

Permalink

Hi Cabral, and thank you.

I started hpcc benchmark using -x PSM_MEMORY=large without any error, I
didn't finish the test for now, but I waited about 10 minutes, and this
time no errors, I even augmented the Ns variable on hpccint.txt and started
the test again without problem.

The cluster is composed of :
- one management node
- 32 compute nodes, each one has 16 cores (2sockets x 8 cores), 32GB of
RAM, and intel qle_7340 one port infiniband 40Gb/s card

I used this site to generate the input file for hpcc :
http://www.advancedclustering.com/act-kb/tune-hpl-dat-file/
with some modifications :

1 # of problems sizes (N)
331520 Ns
1 # of NBs
128 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
16 Ps
32 Qs

The Ns here represents almost 90% of the total memory of the cluster. the
total number of processes is 512, each node will start 16 processes 1 per
core.

Before modifying the PSM_MEMORY value, the test exited with the mentioned
error, even with lower values of Ns.

I find it weird, that there is no mention of this variable anywhere in the
net, not even in the intel true scale ofed+ documentation!!!???

Thanks again.

Post by Cabral, Matias A
Hi Wodel,
As you already figured out, mpirun -x <ENV_VAR=value> âŠ is the right way
to do it so the psm library will read the values when initializing on every
node.
The default value for "PSM_MEMORY" is ânormalâ and you may change it to
âlargeâ. If you want to look inside the code, it is on
https://github.com/01org/psm . One useful variable to play with is
PSM_TRACEMASK (only set it on the head node) to see what values are being
used. I think 0xffff will dump lots of info.
As I mentioned below, playing with the size of the MQ is tricky since will
be using system memory. I think this will be a combination of a) number of
total ranks and per node b) memory on the hosts c) HPCC parameters. The
bigger number of ranks, more ranks will be possibly transmitting
simultaneously to a single node (I would assume a reduction) (a node could
be posting receives at faster rate it is completing them), so will need
bigger MQ, so more memory used. Would you share the number of ranks per
node, nodes, and memory per node to have an idea? A quick test could be to
start with very small number of ranks to see if it runs.
Thanks,
Regards,
_MAC
youchi
*Sent:* Wednesday, February 01, 2017 3:36 AM
*Subject:* Re: [OMPI users] Error using hpcc benchmark
Hi,
Thank you for you replies, but :-) it didn't work for me.
I tried to use *export **PSM_MQ_RECVREQS_MAX=10000000* as mentioned by
Howard, but the job didn't take into account the export (I am starting the
job from the home directory of a user, the home directory is shared by nfs
with all compute nodes).
I tried to use the .bash_profile to export the variable, but the job
didn't take it into account, I got the same error
*Exhausted 1048576 MQ irecv request descriptors, which usually indicates a
user program error or insufficient request descriptors
(PSM_MQ_RECVREQS_MAX=1048576)*
And as I mentioned before, each time on different node(s).
From the help of the mpirun command, I read that to pass an environment
mpirun -np 512* -x PSM_MQ_RECVREQS_MAX=10000000 *--mca mtl psm --hostfile
hosts32 /shared/build/hpcc-1.5.0b-blas-ompi-181/hpcc hpccinf.txt
But when tested, I get this errors
*PSM was unable to open an endpoint. Please make sure that the network
link is active on the node and the hardware is functioning. Error: Ran out
of memory*
I tested with lower values, the only one that worked for me is *2097152 *which
is 2 times the default value of PSM_MQ..., but even with this value, I get
the same error with the new value, and the exits.
*Exhausted 2097152 MQ irecv request descriptors, which usually indicates a
user program error or insufficient request descriptors
(PSM_MQ_RECVREQS_MAX=2097152 )*
PS: for Cabral, I didn't find any way to know the default value of *PSM_MEMORY
*to be able to modify it.
Any idea??? Could this be a problem on the infiniband configuration?
Does the mtu have anything to do with this problem ?
ibv_devinfo
hca_id: qib0
transport: InfiniBand (0)
fw_ver: 0.0.0
node_guid: 0011:7500:0070:59a6
sys_image_guid: 0011:7500:0070:59a6
vendor_id: 0x1175
vendor_part_id: 29474
hw_ver: 0x2
board_id: InfiniPath_QLE7340
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
*max_mtu: 4096 (5)
active_mtu: 2048 (4)*
sm_lid: 1
port_lid: 1
port_lmc: 0x00
link_layer: InfiniBand
Regards.
Hi Wodel,
As Howard mentioned, this is probably because many ranks and sending to a
single one and exhausting the receive requests MQ. You can individually
enlarge the receive/send requests queues with the specific variables
(PSM_MQ_RECVREQS_MAX/ PSM_MQ_SENDREQS_MAX) or increase both with
PSM_MEMORY=max. Note that the psm library will allocate more system memory
for the queues.
Thanks,
_MAC
Pritchard
*Sent:* Tuesday, January 31, 2017 6:38 AM
*Subject:* Re: [OMPI users] Error using hpcc benchmark
Hi Wodel
Randomaccess part of HPCC is probably causing this.
Perhaps set PSM env. variable -
Export PSM_MQ_REVCREQ_MAX=10000000
or something like that.
Alternatively launch the job using
mpirun --mca plm ob1 --host ....
to avoid use of psm. Performance will probably suffer with this option however.
Howard
Hi,
I am n newbie in HPC world
I am trying to execute the hpcc benchmark on our cluster, but every time I
start the job, I get this error, then the job exits
*compute017.22840Exhausted 1048576 MQ irecv request descriptors, which
usually indicates a user program error or insufficient request descriptors
(PSM_MQ_RECVREQS_MAX=1048576) compute024.22840Exhausted 1048576 MQ irecv
request descriptors, which usually indicates a user program error or
insufficient request descriptors (PSM_MQ_RECVREQS_MAX=1048576)
compute019.22847Exhausted 1048576 MQ irecv request descriptors, which
usually indicates a user program error or insufficient request descriptors
(PSM_MQ_RECVREQS_MAX=1048576)
------------------------------------------------------- Primary job
terminated normally, but 1 process returned a non-zero exit code.. Per
user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status,
Process name: [[19601,1],272] Exit code: 255
--------------------------------------------------------------------------*
Platform : IBM PHPC
OS : RHEL 6.5
one management node
32 compute node : 16 cores, 32GB RAM, intel qlogic QLE7340 one port QRD infiniband 40Gb/s
I compiled hpcc against : IBM MPI, Openmpi 2.0.1 (compiled with gcc 4.4.7)
and Openmpi 1.8.1 (compiled with gcc 4.4.7)
I get the errors, but each time on different compute nodes.
This is the command I used to start the job
*mpirun -np 512 --mca mtl psm --hostfile hosts32
/shared/build/hpcc-1.5.0b-blas-ompi-181/hpcc hpccinf.txt*
Any help will be appreciated, and if you need more details, let me know.
Thanks in advance.
Regards.
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users