Adam LeBlanc
2018-11-01 16:05:39 UTC
Hello, I am an employee of the UNH InterOperability Lab, and we are in the
process of testing OFED-4.17-RC1 for the OpenFabrics Alliance. We have
purchased some new hardware that has one processor, and noticed an issue
when running mpi jobs on nodes that do not have similar processor counts.
If we launch the MPI job from a node that has 2 processors, it will fail
and stating there are not enough resources and will not start the run, like
so:
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 14 slots
that were requested by the application: IMB-MPI1 Either request fewer
slots for your application, or make more slots available for use.
--------------------------------------------------------------------------
If we launch the MPI job from the node with one processor, without changing
the mpirun command at all, it runs as expected. Here is the command being
run: mpirun --mca btl_openib_warn_no_device_params_found 0 --mca
orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca
btl_openib_receive_queues P,65536,120,64,32 -hostfile
/home/soesterreich/ce-mpi-hosts IMB-MPI1 Here is the hostfile being used:
farbauti-ce.ofa.iol.unh.edu slots=1 hyperion-ce.ofa.iol.unh.edu slots=1
io-ce.ofa.iol.unh.edu slots=1 jarnsaxa-ce.ofa.iol.unh.edu slots=1
rhea-ce.ofa.iol.unh.edu slots=1 tarqeq-ce.ofa.iol.unh.edu slots=1
tarvos-ce.ofa.iol.unh.edu slots=1 This seems like a bug and we would like
some help to explain and fix what is happening. The IBTA plugfest saw
similar behaviours, so this should be reproduceable. Thanks, Adam LeBlanc
process of testing OFED-4.17-RC1 for the OpenFabrics Alliance. We have
purchased some new hardware that has one processor, and noticed an issue
when running mpi jobs on nodes that do not have similar processor counts.
If we launch the MPI job from a node that has 2 processors, it will fail
and stating there are not enough resources and will not start the run, like
so:
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 14 slots
that were requested by the application: IMB-MPI1 Either request fewer
slots for your application, or make more slots available for use.
--------------------------------------------------------------------------
If we launch the MPI job from the node with one processor, without changing
the mpirun command at all, it runs as expected. Here is the command being
run: mpirun --mca btl_openib_warn_no_device_params_found 0 --mca
orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca
btl_openib_receive_queues P,65536,120,64,32 -hostfile
/home/soesterreich/ce-mpi-hosts IMB-MPI1 Here is the hostfile being used:
farbauti-ce.ofa.iol.unh.edu slots=1 hyperion-ce.ofa.iol.unh.edu slots=1
io-ce.ofa.iol.unh.edu slots=1 jarnsaxa-ce.ofa.iol.unh.edu slots=1
rhea-ce.ofa.iol.unh.edu slots=1 tarqeq-ce.ofa.iol.unh.edu slots=1
tarvos-ce.ofa.iol.unh.edu slots=1 This seems like a bug and we would like
some help to explain and fix what is happening. The IBTA plugfest saw
similar behaviours, so this should be reproduceable. Thanks, Adam LeBlanc