A M
2017-08-09 19:41:00 UTC
Hello,
I have just ran into a strange issue with "mpirun". Here is what happened:
I successfully installed Torque 6.1.1.1 with the plain pbs_sched on a
minimal set of 2 IB nodes. Then I added openmpi 2.1.1 compiled with verbs
and tm, and have verified that mpirun works as it should with a small
"pingpong" program.
Here is my Torque minimal jobscript which I used to check the IB message
passing:
#!/bin/sh
#PBS -o Out
#PBS -e Err
#PBS -l nodes=2:ppn=1
cd $PBS_O_WORKDIR
mpirun -np 2 -pernode ./pingpong 4000000
The job correctly used IB as the default message passing iface and resulted
in 3.6 Gb/sec "pingpong" bandwidth which is correct in my case, since the
two batch nodes have the QDR HCAs.
I have then stopped "pbs_sched" and started the Maui 3.3.1 scheduler
instead. Serial jobs work without any problem, but the same jobscript is
now failing with the following message:
--------
Your job has requested more processes than the ppr for this topology can
support:
App: /lustre/work/user/testus/pingpong
Number of procs: 2
PPR: 1:node
Please revise the conflict and try again.
--------
I then have tried to play with - -nooversubscribe and "--pernode 2"
options, but the error persisted. It looks like the freshmost "mpirun" is
getting some information from the latest available Maui scheduler. It is
enough to go back to "pbs_sched", and everything works like a charm. I used
the preexisting "maui.cfg" file which still works well on the oldish Centos
6 with an old 1.8.5 version of openmpi.
Thanks ahead for any hint/comment on how to address this. Are there any
other mpirun options to try? Should I try to downgrade openmpi to the
latest 1.X series?
Andy.
mpirun -np 2 -pernode --mca btl ^tcp ./pingpong 4000000
2.
I have just ran into a strange issue with "mpirun". Here is what happened:
I successfully installed Torque 6.1.1.1 with the plain pbs_sched on a
minimal set of 2 IB nodes. Then I added openmpi 2.1.1 compiled with verbs
and tm, and have verified that mpirun works as it should with a small
"pingpong" program.
Here is my Torque minimal jobscript which I used to check the IB message
passing:
#!/bin/sh
#PBS -o Out
#PBS -e Err
#PBS -l nodes=2:ppn=1
cd $PBS_O_WORKDIR
mpirun -np 2 -pernode ./pingpong 4000000
The job correctly used IB as the default message passing iface and resulted
in 3.6 Gb/sec "pingpong" bandwidth which is correct in my case, since the
two batch nodes have the QDR HCAs.
I have then stopped "pbs_sched" and started the Maui 3.3.1 scheduler
instead. Serial jobs work without any problem, but the same jobscript is
now failing with the following message:
--------
Your job has requested more processes than the ppr for this topology can
support:
App: /lustre/work/user/testus/pingpong
Number of procs: 2
PPR: 1:node
Please revise the conflict and try again.
--------
I then have tried to play with - -nooversubscribe and "--pernode 2"
options, but the error persisted. It looks like the freshmost "mpirun" is
getting some information from the latest available Maui scheduler. It is
enough to go back to "pbs_sched", and everything works like a charm. I used
the preexisting "maui.cfg" file which still works well on the oldish Centos
6 with an old 1.8.5 version of openmpi.
Thanks ahead for any hint/comment on how to address this. Are there any
other mpirun options to try? Should I try to downgrade openmpi to the
latest 1.X series?
Andy.
mpirun -np 2 -pernode --mca btl ^tcp ./pingpong 4000000
2.