Discussion:
[OMPI users] Bug with Open-MPI Processor Count
Adam LeBlanc
2018-11-01 16:05:39 UTC
Permalink
Hello, I am an employee of the UNH InterOperability Lab, and we are in the
process of testing OFED-4.17-RC1 for the OpenFabrics Alliance. We have
purchased some new hardware that has one processor, and noticed an issue
when running mpi jobs on nodes that do not have similar processor counts.
If we launch the MPI job from a node that has 2 processors, it will fail
and stating there are not enough resources and will not start the run, like
so:
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 14 slots
that were requested by the application: IMB-MPI1 Either request fewer
slots for your application, or make more slots available for use.
--------------------------------------------------------------------------
If we launch the MPI job from the node with one processor, without changing
the mpirun command at all, it runs as expected. Here is the command being
run: mpirun --mca btl_openib_warn_no_device_params_found 0 --mca
orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca
btl_openib_receive_queues P,65536,120,64,32 -hostfile
/home/soesterreich/ce-mpi-hosts IMB-MPI1 Here is the hostfile being used:
farbauti-ce.ofa.iol.unh.edu slots=1 hyperion-ce.ofa.iol.unh.edu slots=1
io-ce.ofa.iol.unh.edu slots=1 jarnsaxa-ce.ofa.iol.unh.edu slots=1
rhea-ce.ofa.iol.unh.edu slots=1 tarqeq-ce.ofa.iol.unh.edu slots=1
tarvos-ce.ofa.iol.unh.edu slots=1 This seems like a bug and we would like
some help to explain and fix what is happening. The IBTA plugfest saw
similar behaviours, so this should be reproduceable. Thanks, Adam LeBlanc
Adam LeBlanc
2018-11-01 16:31:20 UTC
Permalink
The version by the way for Open-MPI is 3.1.2.

-Adam LeBlanc
Post by Adam LeBlanc
Hello, I am an employee of the UNH InterOperability Lab, and we are in the
process of testing OFED-4.17-RC1 for the OpenFabrics Alliance. We have
purchased some new hardware that has one processor, and noticed an issue
when running mpi jobs on nodes that do not have similar processor counts.
If we launch the MPI job from a node that has 2 processors, it will fail
and stating there are not enough resources and will not start the run, like
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 14 slots
that were requested by the application: IMB-MPI1 Either request fewer
slots for your application, or make more slots available for use.
--------------------------------------------------------------------------
If we launch the MPI job from the node with one processor, without changing
the mpirun command at all, it runs as expected. Here is the command being
run: mpirun --mca btl_openib_warn_no_device_params_found 0 --mca
orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca
btl_openib_receive_queues P,65536,120,64,32 -hostfile
farbauti-ce.ofa.iol.unh.edu slots=1 hyperion-ce.ofa.iol.unh.edu slots=1
io-ce.ofa.iol.unh.edu slots=1 jarnsaxa-ce.ofa.iol.unh.edu slots=1
rhea-ce.ofa.iol.unh.edu slots=1 tarqeq-ce.ofa.iol.unh.edu slots=1
tarvos-ce.ofa.iol.unh.edu slots=1 This seems like a bug and we would like
some help to explain and fix what is happening. The IBTA plugfest saw
similar behaviours, so this should be reproduceable. Thanks, Adam LeBlanc
Ralph H Castain
2018-11-01 17:00:12 UTC
Permalink
Set rmaps_base_verbose=10 for debugging output

Sent from my iPhone
Post by Adam LeBlanc
The version by the way for Open-MPI is 3.1.2.
-Adam LeBlanc
Post by Adam LeBlanc
Hello,
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 14 slots
IMB-MPI1
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
If we launch the MPI job from the node with one processor, without changing the mpirun command at all, it runs as expected.
mpirun --mca btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts IMB-MPI1
farbauti-ce.ofa.iol.unh.edu slots=1
hyperion-ce.ofa.iol.unh.edu slots=1
io-ce.ofa.iol.unh.edu slots=1
jarnsaxa-ce.ofa.iol.unh.edu slots=1
rhea-ce.ofa.iol.unh.edu slots=1
tarqeq-ce.ofa.iol.unh.edu slots=1
tarvos-ce.ofa.iol.unh.edu slots=1
This seems like a bug and we would like some help to explain and fix what is happening. The IBTA plugfest saw similar behaviours, so this should be reproduceable.
Thanks,
Adam LeBlanc
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Adam LeBlanc
2018-11-01 17:55:19 UTC
Permalink
Hello Ralph,

Attached below is the verbose output for a failing machine and a passing
machine.

Thanks,
Adam LeBlanc
---------- Forwarded message ---------
Date: Thu, Nov 1, 2018 at 1:07 PM
Subject: Re: [OMPI users] Bug with Open-MPI Processor Count
Set rmaps_base_verbose=10 for debugging output
Sent from my iPhone
The version by the way for Open-MPI is 3.1.2.
-Adam LeBlanc
Post by Adam LeBlanc
Hello, I am an employee of the UNH InterOperability Lab, and we are in
the process of testing OFED-4.17-RC1 for the OpenFabrics Alliance. We have
purchased some new hardware that has one processor, and noticed an issue
when running mpi jobs on nodes that do not have similar processor counts.
If we launch the MPI job from a node that has 2 processors, it will fail
and stating there are not enough resources and will not start the run, like
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 14 slots
that were requested by the application: IMB-MPI1 Either request fewer
slots for your application, or make more slots available for use.
--------------------------------------------------------------------------
If we launch the MPI job from the node with one processor, without changing
the mpirun command at all, it runs as expected. Here is the command being
run: mpirun --mca btl_openib_warn_no_device_params_found 0 --mca
orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca
btl_openib_receive_queues P,65536,120,64,32 -hostfile
farbauti-ce.ofa.iol.unh.edu slots=1 hyperion-ce.ofa.iol.unh.edu slots=1
io-ce.ofa.iol.unh.edu slots=1 jarnsaxa-ce.ofa.iol.unh.edu slots=1
rhea-ce.ofa.iol.unh.edu slots=1 tarqeq-ce.ofa.iol.unh.edu slots=1
tarvos-ce.ofa.iol.unh.edu slots=1 This seems like a bug and we would
like some help to explain and fix what is happening. The IBTA plugfest saw
similar behaviours, so this should be reproduceable. Thanks, Adam LeBlanc
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Ralph H Castain
2018-11-01 18:26:21 UTC
Permalink
I’m a little under the weather and so will only be able to help a bit at a time. However, a couple of things to check:

* add -mca ras_base_verbose 5 to the cmd line to see what mpirun thought the allocation was

* is the hostfile available on every node?

Ralph
Post by Adam LeBlanc
Hello Ralph,
Attached below is the verbose output for a failing machine and a passing machine.
Thanks,
Adam LeBlanc
---------- Forwarded message ---------
Date: Thu, Nov 1, 2018 at 1:07 PM
Subject: Re: [OMPI users] Bug with Open-MPI Processor Count
Set rmaps_base_verbose=10 for debugging output
Sent from my iPhone
Post by Adam LeBlanc
The version by the way for Open-MPI is 3.1.2.
-Adam LeBlanc
Hello,
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 14 slots
IMB-MPI1
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
If we launch the MPI job from the node with one processor, without changing the mpirun command at all, it runs as expected.
mpirun --mca btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts IMB-MPI1
farbauti-ce.ofa.iol.unh.edu <http://farbauti-ce.ofa.iol.unh.edu/> slots=1
hyperion-ce.ofa.iol.unh.edu <http://hyperion-ce.ofa.iol.unh.edu/> slots=1
io-ce.ofa.iol.unh.edu <http://io-ce.ofa.iol.unh.edu/> slots=1
jarnsaxa-ce.ofa.iol.unh.edu <http://jarnsaxa-ce.ofa.iol.unh.edu/> slots=1
rhea-ce.ofa.iol.unh.edu <http://rhea-ce.ofa.iol.unh.edu/> slots=1
tarqeq-ce.ofa.iol.unh.edu <http://tarqeq-ce.ofa.iol.unh.edu/> slots=1
tarvos-ce.ofa.iol.unh.edu <http://tarvos-ce.ofa.iol.unh.edu/> slots=1
This seems like a bug and we would like some help to explain and fix what is happening. The IBTA plugfest saw similar behaviours, so this should be reproduceable.
Thanks,
Adam LeBlanc
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users <https://lists.open-mpi.org/mailman/listinfo/users>_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users <https://lists.open-mpi.org/mailman/listinfo/users><passing_verbose_output.txt><failing_verbose_output.txt>_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Adam LeBlanc
2018-11-01 18:56:09 UTC
Permalink
Hello Ralph,

Here is the output for a failing machine:

[130_02:44:***@farbauti]{~}$ > mpirun --mca
btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0
--mca btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues
P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts --mca
ras_base_verbose 5 IMB-MPI1

====================== ALLOCATED NODES ======================
farbauti: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP
hyperion-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
io-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
jarnsaxa-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
rhea-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
tarqeq-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
tarvos-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 7 slots
that were requested by the application:
10

Either request fewer slots for your application, or make more slots
available
for use.
--------------------------------------------------------------------------


Here is an output of a passing machine:

[1_02:54:***@hyperion]{~}$ > mpirun --mca
btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0
--mca btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues
P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts --mca
ras_base_verbose 5 IMB-MPI1

====================== ALLOCATED NODES ======================
hyperion: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP
farbauti-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
io-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
jarnsaxa-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
rhea-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
tarqeq-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
tarvos-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================


Yes the hostfile is available on all nodes through an NFS mount for all of
our home directories.
---------- Forwarded message ---------
Date: Thu, Nov 1, 2018 at 2:34 PM
Subject: Re: [OMPI users] Bug with Open-MPI Processor Count
I’m a little under the weather and so will only be able to help a bit at a
* add -mca ras_base_verbose 5 to the cmd line to see what mpirun thought the allocation was
* is the hostfile available on every node?
Ralph
Hello Ralph,
Attached below is the verbose output for a failing machine and a passing machine.
Thanks,
Adam LeBlanc
---------- Forwarded message ---------
Date: Thu, Nov 1, 2018 at 1:07 PM
Subject: Re: [OMPI users] Bug with Open-MPI Processor Count
Set rmaps_base_verbose=10 for debugging output
Sent from my iPhone
The version by the way for Open-MPI is 3.1.2.
-Adam LeBlanc
Post by Adam LeBlanc
Hello, I am an employee of the UNH InterOperability Lab, and we are in
the process of testing OFED-4.17-RC1 for the OpenFabrics Alliance. We have
purchased some new hardware that has one processor, and noticed an issue
when running mpi jobs on nodes that do not have similar processor counts.
If we launch the MPI job from a node that has 2 processors, it will fail
and stating there are not enough resources and will not start the run, like
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 14 slots
that were requested by the application: IMB-MPI1 Either request fewer
slots for your application, or make more slots available for use.
--------------------------------------------------------------------------
If we launch the MPI job from the node with one processor, without changing
the mpirun command at all, it runs as expected. Here is the command being
run: mpirun --mca btl_openib_warn_no_device_params_found 0 --mca
orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca
btl_openib_receive_queues P,65536,120,64,32 -hostfile
farbauti-ce.ofa.iol.unh.edu slots=1 hyperion-ce.ofa.iol.unh.edu slots=1
io-ce.ofa.iol.unh.edu slots=1 jarnsaxa-ce.ofa.iol.unh.edu slots=1
rhea-ce.ofa.iol.unh.edu slots=1 tarqeq-ce.ofa.iol.unh.edu slots=1
tarvos-ce.ofa.iol.unh.edu slots=1 This seems like a bug and we would
like some help to explain and fix what is happening. The IBTA plugfest saw
similar behaviours, so this should be reproduceable. Thanks, Adam LeBlanc
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<passing_verbose_output.txt><failing_verbose_output.txt>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Ralph H Castain
2018-11-01 19:52:33 UTC
Permalink
Hmmm - try adding a value for nprocs instead of leaving it blank. Say “-np 7”

Sent from my iPhone
Post by Adam LeBlanc
Hello Ralph,
====================== ALLOCATED NODES ======================
farbauti: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP
hyperion-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
io-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
jarnsaxa-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
rhea-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
tarqeq-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
tarvos-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 7 slots
10
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
====================== ALLOCATED NODES ======================
hyperion: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP
farbauti-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
io-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
jarnsaxa-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
rhea-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
tarqeq-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
tarvos-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
Yes the hostfile is available on all nodes through an NFS mount for all of our home directories.
---------- Forwarded message ---------
Date: Thu, Nov 1, 2018 at 2:34 PM
Subject: Re: [OMPI users] Bug with Open-MPI Processor Count
* add -mca ras_base_verbose 5 to the cmd line to see what mpirun thought the allocation was
* is the hostfile available on every node?
Ralph
Post by Adam LeBlanc
Hello Ralph,
Attached below is the verbose output for a failing machine and a passing machine.
Thanks,
Adam LeBlanc
---------- Forwarded message ---------
Date: Thu, Nov 1, 2018 at 1:07 PM
Subject: Re: [OMPI users] Bug with Open-MPI Processor Count
Set rmaps_base_verbose=10 for debugging output
Sent from my iPhone
Post by Adam LeBlanc
The version by the way for Open-MPI is 3.1.2.
-Adam LeBlanc
Post by Adam LeBlanc
Hello,
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 14 slots
IMB-MPI1
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
If we launch the MPI job from the node with one processor, without changing the mpirun command at all, it runs as expected.
mpirun --mca btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts IMB-MPI1
farbauti-ce.ofa.iol.unh.edu slots=1
hyperion-ce.ofa.iol.unh.edu slots=1
io-ce.ofa.iol.unh.edu slots=1
jarnsaxa-ce.ofa.iol.unh.edu slots=1
rhea-ce.ofa.iol.unh.edu slots=1
tarqeq-ce.ofa.iol.unh.edu slots=1
tarvos-ce.ofa.iol.unh.edu slots=1
This seems like a bug and we would like some help to explain and fix what is happening. The IBTA plugfest saw similar behaviours, so this should be reproduceable.
Thanks,
Adam LeBlanc
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<passing_verbose_output.txt><failing_verbose_output.txt>_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Adam LeBlanc
2018-11-02 15:06:01 UTC
Permalink
Hello Ralph,

When I do the -np 7 it still fails with "There are not enough slots
available in the system to satisfy the 7 slots that were requested by the
application", but when I do -np 2 it will actually run from a machine that
was failing but will only run on one other machine and in this case it ran
from a machine with 2 processors to a machine with only 1 processor. If I
try to make -np higher then 2 it will also fail.

-Adam LeBlanc
Post by Ralph H Castain
Hmmm - try adding a value for nprocs instead of leaving it blank. Say “-np 7”
Sent from my iPhone
Hello Ralph,
btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0
--mca btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues
P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts --mca
ras_base_verbose 5 IMB-MPI1
====================== ALLOCATED NODES ======================
farbauti: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP
hyperion-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
io-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
jarnsaxa-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
rhea-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
tarqeq-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
tarvos-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 7 slots
10
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0
--mca btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues
P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts --mca
ras_base_verbose 5 IMB-MPI1
====================== ALLOCATED NODES ======================
hyperion: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP
farbauti-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
io-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
jarnsaxa-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
rhea-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
tarqeq-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
tarvos-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
Yes the hostfile is available on all nodes through an NFS mount for all of
our home directories.
---------- Forwarded message ---------
Date: Thu, Nov 1, 2018 at 2:34 PM
Subject: Re: [OMPI users] Bug with Open-MPI Processor Count
I’m a little under the weather and so will only be able to help a bit at
* add -mca ras_base_verbose 5 to the cmd line to see what mpirun thought
the allocation was
* is the hostfile available on every node?
Ralph
Hello Ralph,
Attached below is the verbose output for a failing machine and a passing machine.
Thanks,
Adam LeBlanc
---------- Forwarded message ---------
Date: Thu, Nov 1, 2018 at 1:07 PM
Subject: Re: [OMPI users] Bug with Open-MPI Processor Count
Set rmaps_base_verbose=10 for debugging output
Sent from my iPhone
The version by the way for Open-MPI is 3.1.2.
-Adam LeBlanc
Post by Adam LeBlanc
Hello, I am an employee of the UNH InterOperability Lab, and we are in
the process of testing OFED-4.17-RC1 for the OpenFabrics Alliance. We have
purchased some new hardware that has one processor, and noticed an issue
when running mpi jobs on nodes that do not have similar processor counts.
If we launch the MPI job from a node that has 2 processors, it will fail
and stating there are not enough resources and will not start the run, like
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 14 slots
that were requested by the application: IMB-MPI1 Either request fewer
slots for your application, or make more slots available for use.
--------------------------------------------------------------------------
If we launch the MPI job from the node with one processor, without changing
the mpirun command at all, it runs as expected. Here is the command being
run: mpirun --mca btl_openib_warn_no_device_params_found 0 --mca
orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca
btl_openib_receive_queues P,65536,120,64,32 -hostfile
farbauti-ce.ofa.iol.unh.edu slots=1 hyperion-ce.ofa.iol.unh.edu
slots=1 io-ce.ofa.iol.unh.edu slots=1 jarnsaxa-ce.ofa.iol.unh.edu
slots=1 rhea-ce.ofa.iol.unh.edu slots=1 tarqeq-ce.ofa.iol.unh.edu
slots=1 tarvos-ce.ofa.iol.unh.edu slots=1 This seems like a bug and we
would like some help to explain and fix what is happening. The IBTA
plugfest saw similar behaviours, so this should be reproduceable. Thanks,
Adam LeBlanc
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<passing_verbose_output.txt><failing_verbose_output.txt>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Adam LeBlanc
2018-11-08 15:40:30 UTC
Permalink
Hello Ralph,

Is there any update on this?

Thanks,
Adam LeBlanc
Post by Adam LeBlanc
Hello Ralph,
When I do the -np 7 it still fails with "There are not enough slots
available in the system to satisfy the 7 slots that were requested by the
application", but when I do -np 2 it will actually run from a machine that
was failing but will only run on one other machine and in this case it ran
from a machine with 2 processors to a machine with only 1 processor. If I
try to make -np higher then 2 it will also fail.
-Adam LeBlanc
Post by Ralph H Castain
Hmmm - try adding a value for nprocs instead of leaving it blank. Say “-np 7”
Sent from my iPhone
Hello Ralph,
btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0
--mca btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues
P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts --mca
ras_base_verbose 5 IMB-MPI1
====================== ALLOCATED NODES ======================
farbauti: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP
hyperion-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
io-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
jarnsaxa-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
rhea-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
tarqeq-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
tarvos-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 7 slots
10
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0
--mca btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues
P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts --mca
ras_base_verbose 5 IMB-MPI1
====================== ALLOCATED NODES ======================
hyperion: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP
farbauti-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
io-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
jarnsaxa-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
rhea-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
tarqeq-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
tarvos-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
Yes the hostfile is available on all nodes through an NFS mount for all
of our home directories.
---------- Forwarded message ---------
Date: Thu, Nov 1, 2018 at 2:34 PM
Subject: Re: [OMPI users] Bug with Open-MPI Processor Count
I’m a little under the weather and so will only be able to help a bit at
* add -mca ras_base_verbose 5 to the cmd line to see what mpirun thought
the allocation was
* is the hostfile available on every node?
Ralph
Hello Ralph,
Attached below is the verbose output for a failing machine and a passing machine.
Thanks,
Adam LeBlanc
---------- Forwarded message ---------
Date: Thu, Nov 1, 2018 at 1:07 PM
Subject: Re: [OMPI users] Bug with Open-MPI Processor Count
Set rmaps_base_verbose=10 for debugging output
Sent from my iPhone
The version by the way for Open-MPI is 3.1.2.
-Adam LeBlanc
Post by Adam LeBlanc
Hello, I am an employee of the UNH InterOperability Lab, and we are in
the process of testing OFED-4.17-RC1 for the OpenFabrics Alliance. We have
purchased some new hardware that has one processor, and noticed an issue
when running mpi jobs on nodes that do not have similar processor counts.
If we launch the MPI job from a node that has 2 processors, it will fail
and stating there are not enough resources and will not start the run, like
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 14 slots
that were requested by the application: IMB-MPI1 Either request fewer
slots for your application, or make more slots available for use.
--------------------------------------------------------------------------
If we launch the MPI job from the node with one processor, without changing
the mpirun command at all, it runs as expected. Here is the command being
run: mpirun --mca btl_openib_warn_no_device_params_found 0 --mca
orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca
btl_openib_receive_queues P,65536,120,64,32 -hostfile
farbauti-ce.ofa.iol.unh.edu slots=1 hyperion-ce.ofa.iol.unh.edu
slots=1 io-ce.ofa.iol.unh.edu slots=1 jarnsaxa-ce.ofa.iol.unh.edu
slots=1 rhea-ce.ofa.iol.unh.edu slots=1 tarqeq-ce.ofa.iol.unh.edu
slots=1 tarvos-ce.ofa.iol.unh.edu slots=1 This seems like a bug and
we would like some help to explain and fix what is happening. The IBTA
plugfest saw similar behaviours, so this should be reproduceable. Thanks,
Adam LeBlanc
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<passing_verbose_output.txt><failing_verbose_output.txt>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Loading...