Discussion:
[OMPI users] In preparation for 3.x
Eric Chamberland
2017-04-25 18:39:39 UTC
Permalink
Hi,

just testing the 3.x branch... I launch:

mpirun -n 8 echo "hello"

and I get:

--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 8 slots
that were requested by the application:
echo

Either request fewer slots for your application, or make more slots
available
for use.
--------------------------------------------------------------------------

I have to oversubscribe, so what do I have to do to bypass this
"limitation"?

Thanks,

Eric

configure log:

http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h46m08s_config.log
http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h46m08s_ompi_info_all.txt


here is the complete message:

[zorg:30036] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[zorg:30036] plm:base:set_hnp_name: initial bias 30036 nodename hash
810220270
[zorg:30036] plm:base:set_hnp_name: final jobfam 49136
[zorg:30036] [[49136,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[zorg:30036] [[49136,0],0] plm:base:receive start comm
[zorg:30036] [[49136,0],0] plm:base:setup_job
[zorg:30036] [[49136,0],0] plm:base:setup_vm
[zorg:30036] [[49136,0],0] plm:base:setup_vm creating map
[zorg:30036] [[49136,0],0] setup:vm: working unmanaged allocation
[zorg:30036] [[49136,0],0] using default hostfile
/opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
[zorg:30036] [[49136,0],0] plm:base:setup_vm only HNP in allocation
[zorg:30036] [[49136,0],0] plm:base:setting slots for node zorg by cores
[zorg:30036] [[49136,0],0] complete_setup on job [49136,1]
[zorg:30036] [[49136,0],0] plm:base:launch_apps for job [49136,1]
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 8 slots
that were requested by the application:
echo

Either request fewer slots for your application, or make more slots
available
for use.
--------------------------------------------------------------------------
[zorg:30036] [[49136,0],0] plm:base:orted_cmd sending orted_exit commands
[zorg:30036] [[49136,0],0] plm:base:receive stop comm
r***@open-mpi.org
2017-04-25 19:52:50 UTC
Permalink
What is in your hostfile?
Post by Eric Chamberland
Hi,
mpirun -n 8 echo "hello"
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 8 slots
echo
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
I have to oversubscribe, so what do I have to do to bypass this "limitation"?
Thanks,
Eric
http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h46m08s_config.log
http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h46m08s_ompi_info_all.txt
[zorg:30036] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[zorg:30036] plm:base:set_hnp_name: initial bias 30036 nodename hash 810220270
[zorg:30036] plm:base:set_hnp_name: final jobfam 49136
[zorg:30036] [[49136,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[zorg:30036] [[49136,0],0] plm:base:receive start comm
[zorg:30036] [[49136,0],0] plm:base:setup_job
[zorg:30036] [[49136,0],0] plm:base:setup_vm
[zorg:30036] [[49136,0],0] plm:base:setup_vm creating map
[zorg:30036] [[49136,0],0] setup:vm: working unmanaged allocation
[zorg:30036] [[49136,0],0] using default hostfile /opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
[zorg:30036] [[49136,0],0] plm:base:setup_vm only HNP in allocation
[zorg:30036] [[49136,0],0] plm:base:setting slots for node zorg by cores
[zorg:30036] [[49136,0],0] complete_setup on job [49136,1]
[zorg:30036] [[49136,0],0] plm:base:launch_apps for job [49136,1]
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 8 slots
echo
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
[zorg:30036] [[49136,0],0] plm:base:orted_cmd sending orted_exit commands
[zorg:30036] [[49136,0],0] plm:base:receive stop comm
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Eric Chamberland
2017-04-25 19:56:02 UTC
Permalink
Hi,

the host file has been constructed automatically by the
configuration+installation process and seems to contain only comments
and a blank line:

(15:53:50) [zorg]:~> cat /opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
#
# Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
# University Research and Technology
# Corporation. All rights reserved.
# Copyright (c) 2004-2005 The University of Tennessee and The University
# of Tennessee Research Foundation. All rights
# reserved.
# Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
# University of Stuttgart. All rights reserved.
# Copyright (c) 2004-2005 The Regents of the University of California.
# All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
# This is the default hostfile for Open MPI. Notice that it does not
# contain any hosts (not even localhost). This file should only
# contain hosts if a system administrator wants users to always have
# the same set of default hosts, and is not using a batch scheduler
# (such as SLURM, PBS, etc.).
#
# Note that this file is *not* used when running in "managed"
# environments (e.g., running in a job under a job scheduler, such as
# SLURM or PBS / Torque).
#
# If you are primarily interested in running Open MPI on one node, you
# should *not* simply list "localhost" in here (contrary to prior MPI
# implementations, such as LAM/MPI). A localhost-only node list is
# created by the RAS component named "localhost" if no other RAS
# components were able to find any hosts to run on (this behavior can
# be disabled by excluding the localhost RAS component by specifying
# the value "^localhost" [without the quotes] to the "ras" MCA
# parameter).

(15:53:52) [zorg]:~>

Thanks!

Eric
Post by r***@open-mpi.org
What is in your hostfile?
Post by Eric Chamberland
Hi,
mpirun -n 8 echo "hello"
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 8 slots
echo
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
I have to oversubscribe, so what do I have to do to bypass this "limitation"?
Thanks,
Eric
http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h46m08s_config.log
http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h46m08s_ompi_info_all.txt
[zorg:30036] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[zorg:30036] plm:base:set_hnp_name: initial bias 30036 nodename hash 810220270
[zorg:30036] plm:base:set_hnp_name: final jobfam 49136
[zorg:30036] [[49136,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[zorg:30036] [[49136,0],0] plm:base:receive start comm
[zorg:30036] [[49136,0],0] plm:base:setup_job
[zorg:30036] [[49136,0],0] plm:base:setup_vm
[zorg:30036] [[49136,0],0] plm:base:setup_vm creating map
[zorg:30036] [[49136,0],0] setup:vm: working unmanaged allocation
[zorg:30036] [[49136,0],0] using default hostfile /opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
[zorg:30036] [[49136,0],0] plm:base:setup_vm only HNP in allocation
[zorg:30036] [[49136,0],0] plm:base:setting slots for node zorg by cores
[zorg:30036] [[49136,0],0] complete_setup on job [49136,1]
[zorg:30036] [[49136,0],0] plm:base:launch_apps for job [49136,1]
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 8 slots
echo
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
[zorg:30036] [[49136,0],0] plm:base:orted_cmd sending orted_exit commands
[zorg:30036] [[49136,0],0] plm:base:receive stop comm
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
r***@open-mpi.org
2017-04-25 20:00:19 UTC
Permalink
Okay - so effectively you have no hostfile, and no allocation. So this is running just on the one node where mpirun exists?

Add “-mca ras_base_verbose 10 --display-allocation” to your cmd line and let’s see what it found
Post by Eric Chamberland
Hi,
(15:53:50) [zorg]:~> cat /opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
#
# Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
# University Research and Technology
# Corporation. All rights reserved.
# Copyright (c) 2004-2005 The University of Tennessee and The University
# of Tennessee Research Foundation. All rights
# reserved.
# Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
# University of Stuttgart. All rights reserved.
# Copyright (c) 2004-2005 The Regents of the University of California.
# All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
# This is the default hostfile for Open MPI. Notice that it does not
# contain any hosts (not even localhost). This file should only
# contain hosts if a system administrator wants users to always have
# the same set of default hosts, and is not using a batch scheduler
# (such as SLURM, PBS, etc.).
#
# Note that this file is *not* used when running in "managed"
# environments (e.g., running in a job under a job scheduler, such as
# SLURM or PBS / Torque).
#
# If you are primarily interested in running Open MPI on one node, you
# should *not* simply list "localhost" in here (contrary to prior MPI
# implementations, such as LAM/MPI). A localhost-only node list is
# created by the RAS component named "localhost" if no other RAS
# components were able to find any hosts to run on (this behavior can
# be disabled by excluding the localhost RAS component by specifying
# the value "^localhost" [without the quotes] to the "ras" MCA
# parameter).
(15:53:52) [zorg]:~>
Thanks!
Eric
Post by r***@open-mpi.org
What is in your hostfile?
Post by Eric Chamberland
Hi,
mpirun -n 8 echo "hello"
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 8 slots
echo
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
I have to oversubscribe, so what do I have to do to bypass this "limitation"?
Thanks,
Eric
http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h46m08s_config.log
http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h46m08s_ompi_info_all.txt
[zorg:30036] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[zorg:30036] plm:base:set_hnp_name: initial bias 30036 nodename hash 810220270
[zorg:30036] plm:base:set_hnp_name: final jobfam 49136
[zorg:30036] [[49136,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[zorg:30036] [[49136,0],0] plm:base:receive start comm
[zorg:30036] [[49136,0],0] plm:base:setup_job
[zorg:30036] [[49136,0],0] plm:base:setup_vm
[zorg:30036] [[49136,0],0] plm:base:setup_vm creating map
[zorg:30036] [[49136,0],0] setup:vm: working unmanaged allocation
[zorg:30036] [[49136,0],0] using default hostfile /opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
[zorg:30036] [[49136,0],0] plm:base:setup_vm only HNP in allocation
[zorg:30036] [[49136,0],0] plm:base:setting slots for node zorg by cores
[zorg:30036] [[49136,0],0] complete_setup on job [49136,1]
[zorg:30036] [[49136,0],0] plm:base:launch_apps for job [49136,1]
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 8 slots
echo
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
[zorg:30036] [[49136,0],0] plm:base:orted_cmd sending orted_exit commands
[zorg:30036] [[49136,0],0] plm:base:receive stop comm
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Eric Chamberland
2017-04-25 20:31:59 UTC
Permalink
Ok, here it is:

===================
first, with -n 8:
===================

mpirun -mca ras_base_verbose 10 --display-allocation -n 8 echo "Hello"

[zorg:22429] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[zorg:22429] plm:base:set_hnp_name: initial bias 22429 nodename hash
810220270
[zorg:22429] plm:base:set_hnp_name: final jobfam 40249
[zorg:22429] [[40249,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[zorg:22429] [[40249,0],0] plm:base:receive start comm
[zorg:22429] mca: base: components_register: registering framework ras
components
[zorg:22429] mca: base: components_register: found loaded component
loadleveler
[zorg:22429] mca: base: components_register: component loadleveler
register function successful
[zorg:22429] mca: base: components_register: found loaded component slurm
[zorg:22429] mca: base: components_register: component slurm register
function successful
[zorg:22429] mca: base: components_register: found loaded component
simulator
[zorg:22429] mca: base: components_register: component simulator
register function successful
[zorg:22429] mca: base: components_open: opening ras components
[zorg:22429] mca: base: components_open: found loaded component loadleveler
[zorg:22429] mca: base: components_open: component loadleveler open
function successful
[zorg:22429] mca: base: components_open: found loaded component slurm
[zorg:22429] mca: base: components_open: component slurm open function
successful
[zorg:22429] mca: base: components_open: found loaded component simulator
[zorg:22429] mca:base:select: Auto-selecting ras components
[zorg:22429] mca:base:select:( ras) Querying component [loadleveler]
[zorg:22429] [[40249,0],0] ras:loadleveler: NOT available for selection
[zorg:22429] mca:base:select:( ras) Querying component [slurm]
[zorg:22429] mca:base:select:( ras) Querying component [simulator]
[zorg:22429] mca:base:select:( ras) No component selected!
[zorg:22429] [[40249,0],0] plm:base:setup_job
[zorg:22429] [[40249,0],0] ras:base:allocate
[zorg:22429] [[40249,0],0] ras:base:allocate nothing found in module -
proceeding to hostfile
[zorg:22429] [[40249,0],0] ras:base:allocate parsing default hostfile
/opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
[zorg:22429] [[40249,0],0] hostfile: checking hostfile
/opt/openmpi-3.x_debug/etc/openmpi-default-hostfile for nodes
[zorg:22429] [[40249,0],0] ras:base:allocate nothing found in hostfiles
- checking for rankfile
[zorg:22429] [[40249,0],0] ras:base:allocate nothing found in rankfile -
inserting current node
[zorg:22429] [[40249,0],0] ras:base:node_insert inserting 1 nodes
[zorg:22429] [[40249,0],0] ras:base:node_insert updating HNP [zorg] info
to 1 slots

====================== ALLOCATED NODES ======================
zorg: flags=0x01 slots=1 max_slots=0 slots_inuse=0 state=UP
=================================================================
[zorg:22429] [[40249,0],0] plm:base:setup_vm
[zorg:22429] [[40249,0],0] plm:base:setup_vm creating map
[zorg:22429] [[40249,0],0] setup:vm: working unmanaged allocation
[zorg:22429] [[40249,0],0] using default hostfile
/opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
[zorg:22429] [[40249,0],0] hostfile: checking hostfile
/opt/openmpi-3.x_debug/etc/openmpi-default-hostfile for nodes
[zorg:22429] [[40249,0],0] plm:base:setup_vm only HNP in allocation
[zorg:22429] [[40249,0],0] plm:base:setting slots for node zorg by cores

====================== ALLOCATED NODES ======================
zorg: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
=================================================================
[zorg:22429] [[40249,0],0] complete_setup on job [40249,1]
[zorg:22429] [[40249,0],0] plm:base:launch_apps for job [40249,1]
[zorg:22429] [[40249,0],0] hostfile: checking hostfile
/opt/openmpi-3.x_debug/etc/openmpi-default-hostfile for nodes
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 8 slots
that were requested by the application:
echo

Either request fewer slots for your application, or make more slots
available
for use.
--------------------------------------------------------------------------
[zorg:22429] [[40249,0],0] plm:base:orted_cmd sending orted_exit commands
[zorg:22429] [[40249,0],0] plm:base:receive stop comm

===================
second with -n 4:
===================
(16:31:23) [zorg]:~> mpirun -mca ras_base_verbose 10
--display-allocation -n 4 echo "Hello"

[zorg:22463] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[zorg:22463] plm:base:set_hnp_name: initial bias 22463 nodename hash
810220270
[zorg:22463] plm:base:set_hnp_name: final jobfam 40219
[zorg:22463] [[40219,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[zorg:22463] [[40219,0],0] plm:base:receive start comm
[zorg:22463] mca: base: components_register: registering framework ras
components
[zorg:22463] mca: base: components_register: found loaded component
loadleveler
[zorg:22463] mca: base: components_register: component loadleveler
register function successful
[zorg:22463] mca: base: components_register: found loaded component slurm
[zorg:22463] mca: base: components_register: component slurm register
function successful
[zorg:22463] mca: base: components_register: found loaded component
simulator
[zorg:22463] mca: base: components_register: component simulator
register function successful
[zorg:22463] mca: base: components_open: opening ras components
[zorg:22463] mca: base: components_open: found loaded component loadleveler
[zorg:22463] mca: base: components_open: component loadleveler open
function successful
[zorg:22463] mca: base: components_open: found loaded component slurm
[zorg:22463] mca: base: components_open: component slurm open function
successful
[zorg:22463] mca: base: components_open: found loaded component simulator
[zorg:22463] mca:base:select: Auto-selecting ras components
[zorg:22463] mca:base:select:( ras) Querying component [loadleveler]
[zorg:22463] [[40219,0],0] ras:loadleveler: NOT available for selection
[zorg:22463] mca:base:select:( ras) Querying component [slurm]
[zorg:22463] mca:base:select:( ras) Querying component [simulator]
[zorg:22463] mca:base:select:( ras) No component selected!
[zorg:22463] [[40219,0],0] plm:base:setup_job
[zorg:22463] [[40219,0],0] ras:base:allocate
[zorg:22463] [[40219,0],0] ras:base:allocate nothing found in module -
proceeding to hostfile
[zorg:22463] [[40219,0],0] ras:base:allocate parsing default hostfile
/opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
[zorg:22463] [[40219,0],0] hostfile: checking hostfile
/opt/openmpi-3.x_debug/etc/openmpi-default-hostfile for nodes
[zorg:22463] [[40219,0],0] ras:base:allocate nothing found in hostfiles
- checking for rankfile
[zorg:22463] [[40219,0],0] ras:base:allocate nothing found in rankfile -
inserting current node
[zorg:22463] [[40219,0],0] ras:base:node_insert inserting 1 nodes
[zorg:22463] [[40219,0],0] ras:base:node_insert updating HNP [zorg] info
to 1 slots

====================== ALLOCATED NODES ======================
zorg: flags=0x01 slots=1 max_slots=0 slots_inuse=0 state=UP
=================================================================
[zorg:22463] [[40219,0],0] plm:base:setup_vm
[zorg:22463] [[40219,0],0] plm:base:setup_vm creating map
[zorg:22463] [[40219,0],0] setup:vm: working unmanaged allocation
[zorg:22463] [[40219,0],0] using default hostfile
/opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
[zorg:22463] [[40219,0],0] hostfile: checking hostfile
/opt/openmpi-3.x_debug/etc/openmpi-default-hostfile for nodes
[zorg:22463] [[40219,0],0] plm:base:setup_vm only HNP in allocation
[zorg:22463] [[40219,0],0] plm:base:setting slots for node zorg by cores

====================== ALLOCATED NODES ======================
zorg: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
=================================================================
[zorg:22463] [[40219,0],0] complete_setup on job [40219,1]
[zorg:22463] [[40219,0],0] plm:base:launch_apps for job [40219,1]
[zorg:22463] [[40219,0],0] hostfile: checking hostfile
/opt/openmpi-3.x_debug/etc/openmpi-default-hostfile for nodes
[zorg:22463] [[40219,0],0] plm:base:launch wiring up iof for job [40219,1]
[zorg:22463] [[40219,0],0] plm:base:launch job [40219,1] is not a
dynamic spawn
Hello
Hello
Hello
Hello
[zorg:22463] [[40219,0],0] plm:base:orted_cmd sending orted_exit commands
[zorg:22463] [[40219,0],0] plm:base:receive stop comm


Thanks!

Eric
Post by r***@open-mpi.org
-mca ras_base_verbose 10 --display-allocation
r***@open-mpi.org
2017-04-25 20:36:36 UTC
Permalink
Okay, so what’s happening is that we are auto-detecting only 4 cores on that box, and since you didn’t provide any further info, we set the #slots = #cores. If you want to run more than that, you can either tell us a number of slots to use (e.g., -host mybox:32) or add --oversubscribe to the cmd line
Post by Eric Chamberland
===================
===================
mpirun -mca ras_base_verbose 10 --display-allocation -n 8 echo "Hello"
[zorg:22429] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[zorg:22429] plm:base:set_hnp_name: initial bias 22429 nodename hash 810220270
[zorg:22429] plm:base:set_hnp_name: final jobfam 40249
[zorg:22429] [[40249,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[zorg:22429] [[40249,0],0] plm:base:receive start comm
[zorg:22429] mca: base: components_register: registering framework ras components
[zorg:22429] mca: base: components_register: found loaded component loadleveler
[zorg:22429] mca: base: components_register: component loadleveler register function successful
[zorg:22429] mca: base: components_register: found loaded component slurm
[zorg:22429] mca: base: components_register: component slurm register function successful
[zorg:22429] mca: base: components_register: found loaded component simulator
[zorg:22429] mca: base: components_register: component simulator register function successful
[zorg:22429] mca: base: components_open: opening ras components
[zorg:22429] mca: base: components_open: found loaded component loadleveler
[zorg:22429] mca: base: components_open: component loadleveler open function successful
[zorg:22429] mca: base: components_open: found loaded component slurm
[zorg:22429] mca: base: components_open: component slurm open function successful
[zorg:22429] mca: base: components_open: found loaded component simulator
[zorg:22429] mca:base:select: Auto-selecting ras components
[zorg:22429] mca:base:select:( ras) Querying component [loadleveler]
[zorg:22429] [[40249,0],0] ras:loadleveler: NOT available for selection
[zorg:22429] mca:base:select:( ras) Querying component [slurm]
[zorg:22429] mca:base:select:( ras) Querying component [simulator]
[zorg:22429] mca:base:select:( ras) No component selected!
[zorg:22429] [[40249,0],0] plm:base:setup_job
[zorg:22429] [[40249,0],0] ras:base:allocate
[zorg:22429] [[40249,0],0] ras:base:allocate nothing found in module - proceeding to hostfile
[zorg:22429] [[40249,0],0] ras:base:allocate parsing default hostfile /opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
[zorg:22429] [[40249,0],0] hostfile: checking hostfile /opt/openmpi-3.x_debug/etc/openmpi-default-hostfile for nodes
[zorg:22429] [[40249,0],0] ras:base:allocate nothing found in hostfiles - checking for rankfile
[zorg:22429] [[40249,0],0] ras:base:allocate nothing found in rankfile - inserting current node
[zorg:22429] [[40249,0],0] ras:base:node_insert inserting 1 nodes
[zorg:22429] [[40249,0],0] ras:base:node_insert updating HNP [zorg] info to 1 slots
====================== ALLOCATED NODES ======================
zorg: flags=0x01 slots=1 max_slots=0 slots_inuse=0 state=UP
=================================================================
[zorg:22429] [[40249,0],0] plm:base:setup_vm
[zorg:22429] [[40249,0],0] plm:base:setup_vm creating map
[zorg:22429] [[40249,0],0] setup:vm: working unmanaged allocation
[zorg:22429] [[40249,0],0] using default hostfile /opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
[zorg:22429] [[40249,0],0] hostfile: checking hostfile /opt/openmpi-3.x_debug/etc/openmpi-default-hostfile for nodes
[zorg:22429] [[40249,0],0] plm:base:setup_vm only HNP in allocation
[zorg:22429] [[40249,0],0] plm:base:setting slots for node zorg by cores
====================== ALLOCATED NODES ======================
zorg: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
=================================================================
[zorg:22429] [[40249,0],0] complete_setup on job [40249,1]
[zorg:22429] [[40249,0],0] plm:base:launch_apps for job [40249,1]
[zorg:22429] [[40249,0],0] hostfile: checking hostfile /opt/openmpi-3.x_debug/etc/openmpi-default-hostfile for nodes
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 8 slots
echo
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
[zorg:22429] [[40249,0],0] plm:base:orted_cmd sending orted_exit commands
[zorg:22429] [[40249,0],0] plm:base:receive stop comm
===================
===================
(16:31:23) [zorg]:~> mpirun -mca ras_base_verbose 10 --display-allocation -n 4 echo "Hello"
[zorg:22463] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[zorg:22463] plm:base:set_hnp_name: initial bias 22463 nodename hash 810220270
[zorg:22463] plm:base:set_hnp_name: final jobfam 40219
[zorg:22463] [[40219,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[zorg:22463] [[40219,0],0] plm:base:receive start comm
[zorg:22463] mca: base: components_register: registering framework ras components
[zorg:22463] mca: base: components_register: found loaded component loadleveler
[zorg:22463] mca: base: components_register: component loadleveler register function successful
[zorg:22463] mca: base: components_register: found loaded component slurm
[zorg:22463] mca: base: components_register: component slurm register function successful
[zorg:22463] mca: base: components_register: found loaded component simulator
[zorg:22463] mca: base: components_register: component simulator register function successful
[zorg:22463] mca: base: components_open: opening ras components
[zorg:22463] mca: base: components_open: found loaded component loadleveler
[zorg:22463] mca: base: components_open: component loadleveler open function successful
[zorg:22463] mca: base: components_open: found loaded component slurm
[zorg:22463] mca: base: components_open: component slurm open function successful
[zorg:22463] mca: base: components_open: found loaded component simulator
[zorg:22463] mca:base:select: Auto-selecting ras components
[zorg:22463] mca:base:select:( ras) Querying component [loadleveler]
[zorg:22463] [[40219,0],0] ras:loadleveler: NOT available for selection
[zorg:22463] mca:base:select:( ras) Querying component [slurm]
[zorg:22463] mca:base:select:( ras) Querying component [simulator]
[zorg:22463] mca:base:select:( ras) No component selected!
[zorg:22463] [[40219,0],0] plm:base:setup_job
[zorg:22463] [[40219,0],0] ras:base:allocate
[zorg:22463] [[40219,0],0] ras:base:allocate nothing found in module - proceeding to hostfile
[zorg:22463] [[40219,0],0] ras:base:allocate parsing default hostfile /opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
[zorg:22463] [[40219,0],0] hostfile: checking hostfile /opt/openmpi-3.x_debug/etc/openmpi-default-hostfile for nodes
[zorg:22463] [[40219,0],0] ras:base:allocate nothing found in hostfiles - checking for rankfile
[zorg:22463] [[40219,0],0] ras:base:allocate nothing found in rankfile - inserting current node
[zorg:22463] [[40219,0],0] ras:base:node_insert inserting 1 nodes
[zorg:22463] [[40219,0],0] ras:base:node_insert updating HNP [zorg] info to 1 slots
====================== ALLOCATED NODES ======================
zorg: flags=0x01 slots=1 max_slots=0 slots_inuse=0 state=UP
=================================================================
[zorg:22463] [[40219,0],0] plm:base:setup_vm
[zorg:22463] [[40219,0],0] plm:base:setup_vm creating map
[zorg:22463] [[40219,0],0] setup:vm: working unmanaged allocation
[zorg:22463] [[40219,0],0] using default hostfile /opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
[zorg:22463] [[40219,0],0] hostfile: checking hostfile /opt/openmpi-3.x_debug/etc/openmpi-default-hostfile for nodes
[zorg:22463] [[40219,0],0] plm:base:setup_vm only HNP in allocation
[zorg:22463] [[40219,0],0] plm:base:setting slots for node zorg by cores
====================== ALLOCATED NODES ======================
zorg: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
=================================================================
[zorg:22463] [[40219,0],0] complete_setup on job [40219,1]
[zorg:22463] [[40219,0],0] plm:base:launch_apps for job [40219,1]
[zorg:22463] [[40219,0],0] hostfile: checking hostfile /opt/openmpi-3.x_debug/etc/openmpi-default-hostfile for nodes
[zorg:22463] [[40219,0],0] plm:base:launch wiring up iof for job [40219,1]
[zorg:22463] [[40219,0],0] plm:base:launch job [40219,1] is not a dynamic spawn
Hello
Hello
Hello
Hello
[zorg:22463] [[40219,0],0] plm:base:orted_cmd sending orted_exit commands
[zorg:22463] [[40219,0],0] plm:base:receive stop comm
Thanks!
Eric
Post by r***@open-mpi.org
-mca ras_base_verbose 10 --display-allocation
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Eric Chamberland
2017-04-25 21:10:12 UTC
Permalink
Post by r***@open-mpi.org
add --oversubscribe to the cmd line
good, it works! :)

Is there an environment variable equivalent to --oversubscribe argument?

I can't find this option in near related FAQ entries, should it be added
here? :

https://www.open-mpi.org/faq/?category=running#oversubscribing

or here ? :

https://www.open-mpi.org/faq/?category=running#force-aggressive-degraded

I was using:

export OMPI_MCA_mpi_yield_when_idle=1
export OMPI_MCA_hwloc_base_binding_policy=none

and it was ok before...

Thanks,

Eric
r***@open-mpi.org
2017-04-25 21:30:44 UTC
Permalink
Sure - there is always an MCA param for everything: OMPI_MCA_rmaps_base_oversubscribe=1
Post by Eric Chamberland
Post by r***@open-mpi.org
add --oversubscribe to the cmd line
good, it works! :)
Is there an environment variable equivalent to --oversubscribe argument?
https://www.open-mpi.org/faq/?category=running#oversubscribing
https://www.open-mpi.org/faq/?category=running#force-aggressive-degraded
export OMPI_MCA_mpi_yield_when_idle=1
export OMPI_MCA_hwloc_base_binding_policy=none
and it was ok before...
Thanks,
Eric
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Eric Chamberland
2017-04-25 20:36:58 UTC
Permalink
Oh, forgot something important,

since OpenMPI 1.8.x I am using:

export OMPI_MCA_hwloc_base_binding_policy=none

Also, I am exporting this since 1.6.x?:

export OMPI_MCA_mpi_yield_when_idle=1

Eric
Post by Eric Chamberland
===================
===================
mpirun -mca ras_base_verbose 10 --display-allocation -n 8 echo "Hello"
[zorg:22429] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[zorg:22429] plm:base:set_hnp_name: initial bias 22429 nodename hash
810220270
[zorg:22429] plm:base:set_hnp_name: final jobfam 40249
[zorg:22429] [[40249,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[zorg:22429] [[40249,0],0] plm:base:receive start comm
[zorg:22429] mca: base: components_register: registering framework ras
components
[zorg:22429] mca: base: components_register: found loaded component
loadleveler
[zorg:22429] mca: base: components_register: component loadleveler
register function successful
[zorg:22429] mca: base: components_register: found loaded component slurm
[zorg:22429] mca: base: components_register: component slurm register
function successful
[zorg:22429] mca: base: components_register: found loaded component
simulator
[zorg:22429] mca: base: components_register: component simulator
register function successful
[zorg:22429] mca: base: components_open: opening ras components
[zorg:22429] mca: base: components_open: found loaded component loadleveler
[zorg:22429] mca: base: components_open: component loadleveler open
function successful
[zorg:22429] mca: base: components_open: found loaded component slurm
[zorg:22429] mca: base: components_open: component slurm open function
successful
[zorg:22429] mca: base: components_open: found loaded component simulator
[zorg:22429] mca:base:select: Auto-selecting ras components
[zorg:22429] mca:base:select:( ras) Querying component [loadleveler]
[zorg:22429] [[40249,0],0] ras:loadleveler: NOT available for selection
[zorg:22429] mca:base:select:( ras) Querying component [slurm]
[zorg:22429] mca:base:select:( ras) Querying component [simulator]
[zorg:22429] mca:base:select:( ras) No component selected!
[zorg:22429] [[40249,0],0] plm:base:setup_job
[zorg:22429] [[40249,0],0] ras:base:allocate
[zorg:22429] [[40249,0],0] ras:base:allocate nothing found in module -
proceeding to hostfile
[zorg:22429] [[40249,0],0] ras:base:allocate parsing default hostfile
/opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
[zorg:22429] [[40249,0],0] hostfile: checking hostfile
/opt/openmpi-3.x_debug/etc/openmpi-default-hostfile for nodes
[zorg:22429] [[40249,0],0] ras:base:allocate nothing found in hostfiles
- checking for rankfile
[zorg:22429] [[40249,0],0] ras:base:allocate nothing found in rankfile -
inserting current node
[zorg:22429] [[40249,0],0] ras:base:node_insert inserting 1 nodes
[zorg:22429] [[40249,0],0] ras:base:node_insert updating HNP [zorg] info
to 1 slots
====================== ALLOCATED NODES ======================
zorg: flags=0x01 slots=1 max_slots=0 slots_inuse=0 state=UP
=================================================================
[zorg:22429] [[40249,0],0] plm:base:setup_vm
[zorg:22429] [[40249,0],0] plm:base:setup_vm creating map
[zorg:22429] [[40249,0],0] setup:vm: working unmanaged allocation
[zorg:22429] [[40249,0],0] using default hostfile
/opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
[zorg:22429] [[40249,0],0] hostfile: checking hostfile
/opt/openmpi-3.x_debug/etc/openmpi-default-hostfile for nodes
[zorg:22429] [[40249,0],0] plm:base:setup_vm only HNP in allocation
[zorg:22429] [[40249,0],0] plm:base:setting slots for node zorg by cores
====================== ALLOCATED NODES ======================
zorg: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
=================================================================
[zorg:22429] [[40249,0],0] complete_setup on job [40249,1]
[zorg:22429] [[40249,0],0] plm:base:launch_apps for job [40249,1]
[zorg:22429] [[40249,0],0] hostfile: checking hostfile
/opt/openmpi-3.x_debug/etc/openmpi-default-hostfile for nodes
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 8 slots
echo
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
[zorg:22429] [[40249,0],0] plm:base:orted_cmd sending orted_exit commands
[zorg:22429] [[40249,0],0] plm:base:receive stop comm
===================
===================
(16:31:23) [zorg]:~> mpirun -mca ras_base_verbose 10
--display-allocation -n 4 echo "Hello"
[zorg:22463] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[zorg:22463] plm:base:set_hnp_name: initial bias 22463 nodename hash
810220270
[zorg:22463] plm:base:set_hnp_name: final jobfam 40219
[zorg:22463] [[40219,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[zorg:22463] [[40219,0],0] plm:base:receive start comm
[zorg:22463] mca: base: components_register: registering framework ras
components
[zorg:22463] mca: base: components_register: found loaded component
loadleveler
[zorg:22463] mca: base: components_register: component loadleveler
register function successful
[zorg:22463] mca: base: components_register: found loaded component slurm
[zorg:22463] mca: base: components_register: component slurm register
function successful
[zorg:22463] mca: base: components_register: found loaded component
simulator
[zorg:22463] mca: base: components_register: component simulator
register function successful
[zorg:22463] mca: base: components_open: opening ras components
[zorg:22463] mca: base: components_open: found loaded component loadleveler
[zorg:22463] mca: base: components_open: component loadleveler open
function successful
[zorg:22463] mca: base: components_open: found loaded component slurm
[zorg:22463] mca: base: components_open: component slurm open function
successful
[zorg:22463] mca: base: components_open: found loaded component simulator
[zorg:22463] mca:base:select: Auto-selecting ras components
[zorg:22463] mca:base:select:( ras) Querying component [loadleveler]
[zorg:22463] [[40219,0],0] ras:loadleveler: NOT available for selection
[zorg:22463] mca:base:select:( ras) Querying component [slurm]
[zorg:22463] mca:base:select:( ras) Querying component [simulator]
[zorg:22463] mca:base:select:( ras) No component selected!
[zorg:22463] [[40219,0],0] plm:base:setup_job
[zorg:22463] [[40219,0],0] ras:base:allocate
[zorg:22463] [[40219,0],0] ras:base:allocate nothing found in module -
proceeding to hostfile
[zorg:22463] [[40219,0],0] ras:base:allocate parsing default hostfile
/opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
[zorg:22463] [[40219,0],0] hostfile: checking hostfile
/opt/openmpi-3.x_debug/etc/openmpi-default-hostfile for nodes
[zorg:22463] [[40219,0],0] ras:base:allocate nothing found in hostfiles
- checking for rankfile
[zorg:22463] [[40219,0],0] ras:base:allocate nothing found in rankfile -
inserting current node
[zorg:22463] [[40219,0],0] ras:base:node_insert inserting 1 nodes
[zorg:22463] [[40219,0],0] ras:base:node_insert updating HNP [zorg] info
to 1 slots
====================== ALLOCATED NODES ======================
zorg: flags=0x01 slots=1 max_slots=0 slots_inuse=0 state=UP
=================================================================
[zorg:22463] [[40219,0],0] plm:base:setup_vm
[zorg:22463] [[40219,0],0] plm:base:setup_vm creating map
[zorg:22463] [[40219,0],0] setup:vm: working unmanaged allocation
[zorg:22463] [[40219,0],0] using default hostfile
/opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
[zorg:22463] [[40219,0],0] hostfile: checking hostfile
/opt/openmpi-3.x_debug/etc/openmpi-default-hostfile for nodes
[zorg:22463] [[40219,0],0] plm:base:setup_vm only HNP in allocation
[zorg:22463] [[40219,0],0] plm:base:setting slots for node zorg by cores
====================== ALLOCATED NODES ======================
zorg: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
=================================================================
[zorg:22463] [[40219,0],0] complete_setup on job [40219,1]
[zorg:22463] [[40219,0],0] plm:base:launch_apps for job [40219,1]
[zorg:22463] [[40219,0],0] hostfile: checking hostfile
/opt/openmpi-3.x_debug/etc/openmpi-default-hostfile for nodes
[zorg:22463] [[40219,0],0] plm:base:launch wiring up iof for job [40219,1]
[zorg:22463] [[40219,0],0] plm:base:launch job [40219,1] is not a
dynamic spawn
Hello
Hello
Hello
Hello
[zorg:22463] [[40219,0],0] plm:base:orted_cmd sending orted_exit commands
[zorg:22463] [[40219,0],0] plm:base:receive stop comm
Thanks!
Eric
Post by r***@open-mpi.org
-mca ras_base_verbose 10 --display-allocation
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
George Bosilca
2017-04-25 20:32:47 UTC
Permalink
I confirm a similar issue on a more managed environment. I have an hostfile
that worked for the last few years, and that span across a small cluster
(30 nodes of 8 cores each).

Trying to spawn any number of processes across P nodes fails if the number
of processes is larger than P (despite the fact that there are largely
enough resources, and that this information is provided via the hostfile).

George.


$ mpirun -mca ras_base_verbose 10 --display-allocation -np 4 --host
dancer00,dancer01 --map-by

[dancer.icl.utk.edu:13457] mca: base: components_register: registering
framework ras components
[dancer.icl.utk.edu:13457] mca: base: components_register: found loaded
component simulator
[dancer.icl.utk.edu:13457] mca: base: components_register: component
simulator register function successful
[dancer.icl.utk.edu:13457] mca: base: components_register: found loaded
component slurm
[dancer.icl.utk.edu:13457] mca: base: components_register: component slurm
register function successful
[dancer.icl.utk.edu:13457] mca: base: components_register: found loaded
component loadleveler
[dancer.icl.utk.edu:13457] mca: base: components_register: component
loadleveler register function successful
[dancer.icl.utk.edu:13457] mca: base: components_register: found loaded
component tm
[dancer.icl.utk.edu:13457] mca: base: components_register: component tm
register function successful
[dancer.icl.utk.edu:13457] mca: base: components_open: opening ras
components
[dancer.icl.utk.edu:13457] mca: base: components_open: found loaded
component simulator
[dancer.icl.utk.edu:13457] mca: base: components_open: found loaded
component slurm
[dancer.icl.utk.edu:13457] mca: base: components_open: component slurm open
function successful
[dancer.icl.utk.edu:13457] mca: base: components_open: found loaded
component loadleveler
[dancer.icl.utk.edu:13457] mca: base: components_open: component
loadleveler open function successful
[dancer.icl.utk.edu:13457] mca: base: components_open: found loaded
component tm
[dancer.icl.utk.edu:13457] mca: base: components_open: component tm open
function successful
[dancer.icl.utk.edu:13457] mca:base:select: Auto-selecting ras components
[dancer.icl.utk.edu:13457] mca:base:select:( ras) Querying component
[simulator]
[dancer.icl.utk.edu:13457] mca:base:select:( ras) Querying component
[slurm]
[dancer.icl.utk.edu:13457] mca:base:select:( ras) Querying component
[loadleveler]
[dancer.icl.utk.edu:13457] mca:base:select:( ras) Querying component [tm]
[dancer.icl.utk.edu:13457] mca:base:select:( ras) No component selected!

====================== ALLOCATED NODES ======================
dancer00: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer01: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer02: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer03: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer04: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer05: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer06: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer07: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer08: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer09: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer10: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer11: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer12: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer13: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer14: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer15: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer16: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer17: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer18: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer19: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer20: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer21: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer22: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer23: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer24: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer25: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer26: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer27: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer28: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer29: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer30: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer31: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================

====================== ALLOCATED NODES ======================
dancer00: flags=0x13 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer01: flags=0x13 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer02: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer03: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer04: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer05: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer06: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer07: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer08: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer09: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer10: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer11: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer12: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer13: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer14: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer15: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer16: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer17: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer18: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer19: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer20: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer21: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer22: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer23: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer24: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer25: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer26: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer27: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer28: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer29: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer30: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer31: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 4 slots
that were requested by the application:
startup

Either request fewer slots for your application, or make more slots
available
for use.
--------------------------------------------------------------------------
Post by r***@open-mpi.org
Okay - so effectively you have no hostfile, and no allocation. So this is
running just on the one node where mpirun exists?
Add “-mca ras_base_verbose 10 --display-allocation” to your cmd line and
let’s see what it found
Post by Eric Chamberland
Hi,
the host file has been constructed automatically by the
configuration+installation process and seems to contain only comments and a
Post by Eric Chamberland
(15:53:50) [zorg]:~> cat /opt/openmpi-3.x_debug/etc/
openmpi-default-hostfile
Post by Eric Chamberland
#
# Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
# University Research and Technology
# Corporation. All rights reserved.
# Copyright (c) 2004-2005 The University of Tennessee and The University
# of Tennessee Research Foundation. All rights
# reserved.
# Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
# University of Stuttgart. All rights reserved.
# Copyright (c) 2004-2005 The Regents of the University of California.
# All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
# This is the default hostfile for Open MPI. Notice that it does not
# contain any hosts (not even localhost). This file should only
# contain hosts if a system administrator wants users to always have
# the same set of default hosts, and is not using a batch scheduler
# (such as SLURM, PBS, etc.).
#
# Note that this file is *not* used when running in "managed"
# environments (e.g., running in a job under a job scheduler, such as
# SLURM or PBS / Torque).
#
# If you are primarily interested in running Open MPI on one node, you
# should *not* simply list "localhost" in here (contrary to prior MPI
# implementations, such as LAM/MPI). A localhost-only node list is
# created by the RAS component named "localhost" if no other RAS
# components were able to find any hosts to run on (this behavior can
# be disabled by excluding the localhost RAS component by specifying
# the value "^localhost" [without the quotes] to the "ras" MCA
# parameter).
(15:53:52) [zorg]:~>
Thanks!
Eric
Post by r***@open-mpi.org
What is in your hostfile?
On Apr 25, 2017, at 11:39 AM, Eric Chamberland <
Hi,
mpirun -n 8 echo "hello"
------------------------------------------------------------
--------------
Post by Eric Chamberland
Post by r***@open-mpi.org
There are not enough slots available in the system to satisfy the 8
slots
Post by Eric Chamberland
Post by r***@open-mpi.org
echo
Either request fewer slots for your application, or make more slots
available
Post by Eric Chamberland
Post by r***@open-mpi.org
for use.
------------------------------------------------------------
--------------
Post by Eric Chamberland
Post by r***@open-mpi.org
I have to oversubscribe, so what do I have to do to bypass this
"limitation"?
Post by Eric Chamberland
Post by r***@open-mpi.org
Thanks,
Eric
http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.
10h46m08s_config.log
Post by Eric Chamberland
Post by r***@open-mpi.org
http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.
10h46m08s_ompi_info_all.txt
Post by Eric Chamberland
Post by r***@open-mpi.org
[zorg:30036] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh
path NULL
Post by Eric Chamberland
Post by r***@open-mpi.org
[zorg:30036] plm:base:set_hnp_name: initial bias 30036 nodename hash
810220270
Post by Eric Chamberland
Post by r***@open-mpi.org
[zorg:30036] plm:base:set_hnp_name: final jobfam 49136
[zorg:30036] [[49136,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[zorg:30036] [[49136,0],0] plm:base:receive start comm
[zorg:30036] [[49136,0],0] plm:base:setup_job
[zorg:30036] [[49136,0],0] plm:base:setup_vm
[zorg:30036] [[49136,0],0] plm:base:setup_vm creating map
[zorg:30036] [[49136,0],0] setup:vm: working unmanaged allocation
[zorg:30036] [[49136,0],0] using default hostfile
/opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
Post by Eric Chamberland
Post by r***@open-mpi.org
[zorg:30036] [[49136,0],0] plm:base:setup_vm only HNP in allocation
[zorg:30036] [[49136,0],0] plm:base:setting slots for node zorg by
cores
Post by Eric Chamberland
Post by r***@open-mpi.org
[zorg:30036] [[49136,0],0] complete_setup on job [49136,1]
[zorg:30036] [[49136,0],0] plm:base:launch_apps for job [49136,1]
------------------------------------------------------------
--------------
Post by Eric Chamberland
Post by r***@open-mpi.org
There are not enough slots available in the system to satisfy the 8
slots
Post by Eric Chamberland
Post by r***@open-mpi.org
echo
Either request fewer slots for your application, or make more slots
available
Post by Eric Chamberland
Post by r***@open-mpi.org
for use.
------------------------------------------------------------
--------------
Post by Eric Chamberland
Post by r***@open-mpi.org
[zorg:30036] [[49136,0],0] plm:base:orted_cmd sending orted_exit
commands
Post by Eric Chamberland
Post by r***@open-mpi.org
[zorg:30036] [[49136,0],0] plm:base:receive stop comm
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
George Bosilca
2017-04-25 20:35:20 UTC
Permalink
Just to be clear, the hostfile contains the correct info:

dancer00 slots=8
dancer01 slots=8

The output regarding the 2 nodes (dancer00 and dancer01) is clearly wrong.

George.
Post by George Bosilca
I confirm a similar issue on a more managed environment. I have an
hostfile that worked for the last few years, and that span across a small
cluster (30 nodes of 8 cores each).
Trying to spawn any number of processes across P nodes fails if the number
of processes is larger than P (despite the fact that there are largely
enough resources, and that this information is provided via the hostfile).
George.
$ mpirun -mca ras_base_verbose 10 --display-allocation -np 4 --host
dancer00,dancer01 --map-by
[dancer.icl.utk.edu:13457] mca: base: components_register: registering
framework ras components
[dancer.icl.utk.edu:13457] mca: base: components_register: found loaded
component simulator
[dancer.icl.utk.edu:13457] mca: base: components_register: component
simulator register function successful
[dancer.icl.utk.edu:13457] mca: base: components_register: found loaded
component slurm
[dancer.icl.utk.edu:13457] mca: base: components_register: component
slurm register function successful
[dancer.icl.utk.edu:13457] mca: base: components_register: found loaded
component loadleveler
[dancer.icl.utk.edu:13457] mca: base: components_register: component
loadleveler register function successful
[dancer.icl.utk.edu:13457] mca: base: components_register: found loaded
component tm
[dancer.icl.utk.edu:13457] mca: base: components_register: component tm
register function successful
[dancer.icl.utk.edu:13457] mca: base: components_open: opening ras
components
[dancer.icl.utk.edu:13457] mca: base: components_open: found loaded
component simulator
[dancer.icl.utk.edu:13457] mca: base: components_open: found loaded
component slurm
[dancer.icl.utk.edu:13457] mca: base: components_open: component slurm
open function successful
[dancer.icl.utk.edu:13457] mca: base: components_open: found loaded
component loadleveler
[dancer.icl.utk.edu:13457] mca: base: components_open: component
loadleveler open function successful
[dancer.icl.utk.edu:13457] mca: base: components_open: found loaded
component tm
[dancer.icl.utk.edu:13457] mca: base: components_open: component tm open
function successful
[dancer.icl.utk.edu:13457] mca:base:select: Auto-selecting ras components
[dancer.icl.utk.edu:13457] mca:base:select:( ras) Querying component
[simulator]
[dancer.icl.utk.edu:13457] mca:base:select:( ras) Querying component
[slurm]
[dancer.icl.utk.edu:13457] mca:base:select:( ras) Querying component
[loadleveler]
[dancer.icl.utk.edu:13457] mca:base:select:( ras) Querying component [tm]
[dancer.icl.utk.edu:13457] mca:base:select:( ras) No component selected!
====================== ALLOCATED NODES ======================
dancer00: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer01: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer02: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer03: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer04: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer05: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer06: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer07: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer08: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer09: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer10: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer11: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer12: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer13: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer14: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer15: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer16: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer17: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer18: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer19: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer20: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer21: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer22: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer23: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer24: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer25: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer26: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer27: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer28: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer29: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer30: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer31: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
====================== ALLOCATED NODES ======================
dancer00: flags=0x13 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer01: flags=0x13 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer02: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer03: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer04: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer05: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer06: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer07: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer08: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer09: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer10: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer11: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer12: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer13: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer14: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer15: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer16: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer17: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer18: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer19: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer20: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer21: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer22: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer23: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer24: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer25: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer26: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer27: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer28: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer29: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer30: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer31: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 4 slots
startup
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
Post by r***@open-mpi.org
Okay - so effectively you have no hostfile, and no allocation. So this is
running just on the one node where mpirun exists?
Add “-mca ras_base_verbose 10 --display-allocation” to your cmd line and
let’s see what it found
On Apr 25, 2017, at 12:56 PM, Eric Chamberland <
Hi,
the host file has been constructed automatically by the
configuration+installation process and seems to contain only comments and a
(15:53:50) [zorg]:~> cat /opt/openmpi-3.x_debug/etc/ope
nmpi-default-hostfile
#
# Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
# University Research and Technology
# Corporation. All rights reserved.
# Copyright (c) 2004-2005 The University of Tennessee and The University
# of Tennessee Research Foundation. All rights
# reserved.
# Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
# University of Stuttgart. All rights reserved.
# Copyright (c) 2004-2005 The Regents of the University of California.
# All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
# This is the default hostfile for Open MPI. Notice that it does not
# contain any hosts (not even localhost). This file should only
# contain hosts if a system administrator wants users to always have
# the same set of default hosts, and is not using a batch scheduler
# (such as SLURM, PBS, etc.).
#
# Note that this file is *not* used when running in "managed"
# environments (e.g., running in a job under a job scheduler, such as
# SLURM or PBS / Torque).
#
# If you are primarily interested in running Open MPI on one node, you
# should *not* simply list "localhost" in here (contrary to prior MPI
# implementations, such as LAM/MPI). A localhost-only node list is
# created by the RAS component named "localhost" if no other RAS
# components were able to find any hosts to run on (this behavior can
# be disabled by excluding the localhost RAS component by specifying
# the value "^localhost" [without the quotes] to the "ras" MCA
# parameter).
(15:53:52) [zorg]:~>
Thanks!
Eric
Post by r***@open-mpi.org
What is in your hostfile?
On Apr 25, 2017, at 11:39 AM, Eric Chamberland <
Hi,
mpirun -n 8 echo "hello"
------------------------------------------------------------
--------------
Post by r***@open-mpi.org
There are not enough slots available in the system to satisfy the 8
slots
Post by r***@open-mpi.org
echo
Either request fewer slots for your application, or make more slots
available
Post by r***@open-mpi.org
for use.
------------------------------------------------------------
--------------
Post by r***@open-mpi.org
I have to oversubscribe, so what do I have to do to bypass this
"limitation"?
Post by r***@open-mpi.org
Thanks,
Eric
http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h
46m08s_config.log
Post by r***@open-mpi.org
http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h
46m08s_ompi_info_all.txt
Post by r***@open-mpi.org
[zorg:30036] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh
path NULL
Post by r***@open-mpi.org
[zorg:30036] plm:base:set_hnp_name: initial bias 30036 nodename hash
810220270
Post by r***@open-mpi.org
[zorg:30036] plm:base:set_hnp_name: final jobfam 49136
[zorg:30036] [[49136,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[zorg:30036] [[49136,0],0] plm:base:receive start comm
[zorg:30036] [[49136,0],0] plm:base:setup_job
[zorg:30036] [[49136,0],0] plm:base:setup_vm
[zorg:30036] [[49136,0],0] plm:base:setup_vm creating map
[zorg:30036] [[49136,0],0] setup:vm: working unmanaged allocation
[zorg:30036] [[49136,0],0] using default hostfile
/opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
Post by r***@open-mpi.org
[zorg:30036] [[49136,0],0] plm:base:setup_vm only HNP in allocation
[zorg:30036] [[49136,0],0] plm:base:setting slots for node zorg by
cores
Post by r***@open-mpi.org
[zorg:30036] [[49136,0],0] complete_setup on job [49136,1]
[zorg:30036] [[49136,0],0] plm:base:launch_apps for job [49136,1]
------------------------------------------------------------
--------------
Post by r***@open-mpi.org
There are not enough slots available in the system to satisfy the 8
slots
Post by r***@open-mpi.org
echo
Either request fewer slots for your application, or make more slots
available
Post by r***@open-mpi.org
for use.
------------------------------------------------------------
--------------
Post by r***@open-mpi.org
[zorg:30036] [[49136,0],0] plm:base:orted_cmd sending orted_exit
commands
Post by r***@open-mpi.org
[zorg:30036] [[49136,0],0] plm:base:receive stop comm
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
r***@open-mpi.org
2017-04-25 20:53:49 UTC
Permalink
I suspect it read the file just fine - what you are seeing in the output is a reflection of the community’s design decision that only one slot would be allocated for each time a node is listed in -host. This is why they added the :N modifier so you can specify the #slots to use in lieu of writing the host name N times

If this isn’t what you feel it should do, then please look at the files in orte/util/dash_host and feel free to propose a modification to the behavior. I personally am not bound to any particular answer, but I really don’t have time to address it again.
Post by George Bosilca
dancer00 slots=8
dancer01 slots=8
The output regarding the 2 nodes (dancer00 and dancer01) is clearly wrong.
George.
I confirm a similar issue on a more managed environment. I have an hostfile that worked for the last few years, and that span across a small cluster (30 nodes of 8 cores each).
Trying to spawn any number of processes across P nodes fails if the number of processes is larger than P (despite the fact that there are largely enough resources, and that this information is provided via the hostfile).
George.
$ mpirun -mca ras_base_verbose 10 --display-allocation -np 4 --host dancer00,dancer01 --map-by
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_register: registering framework ras components
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_register: found loaded component simulator
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_register: component simulator register function successful
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_register: found loaded component slurm
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_register: component slurm register function successful
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_register: found loaded component loadleveler
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_register: component loadleveler register function successful
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_register: found loaded component tm
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_register: component tm register function successful
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_open: opening ras components
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_open: found loaded component simulator
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_open: found loaded component slurm
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_open: component slurm open function successful
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_open: found loaded component loadleveler
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_open: component loadleveler open function successful
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_open: found loaded component tm
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_open: component tm open function successful
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca:base:select: Auto-selecting ras components
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca:base:select:( ras) Querying component [simulator]
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca:base:select:( ras) Querying component [slurm]
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca:base:select:( ras) Querying component [loadleveler]
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca:base:select:( ras) Querying component [tm]
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca:base:select:( ras) No component selected!
====================== ALLOCATED NODES ======================
dancer00: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer01: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer02: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer03: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer04: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer05: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer06: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer07: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer08: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer09: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer10: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer11: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer12: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer13: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer14: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer15: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer16: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer17: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer18: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer19: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer20: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer21: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer22: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer23: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer24: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer25: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer26: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer27: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer28: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer29: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer30: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer31: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
====================== ALLOCATED NODES ======================
dancer00: flags=0x13 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer01: flags=0x13 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer02: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer03: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer04: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer05: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer06: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer07: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer08: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer09: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer10: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer11: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer12: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer13: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer14: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer15: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer16: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer17: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer18: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer19: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer20: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer21: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer22: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer23: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer24: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer25: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer26: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer27: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer28: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer29: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer30: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer31: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 4 slots
startup
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
Okay - so effectively you have no hostfile, and no allocation. So this is running just on the one node where mpirun exists?
Add “-mca ras_base_verbose 10 --display-allocation” to your cmd line and let’s see what it found
Post by Eric Chamberland
Hi,
(15:53:50) [zorg]:~> cat /opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
#
# Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
# University Research and Technology
# Corporation. All rights reserved.
# Copyright (c) 2004-2005 The University of Tennessee and The University
# of Tennessee Research Foundation. All rights
# reserved.
# Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
# University of Stuttgart. All rights reserved.
# Copyright (c) 2004-2005 The Regents of the University of California.
# All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
# This is the default hostfile for Open MPI. Notice that it does not
# contain any hosts (not even localhost). This file should only
# contain hosts if a system administrator wants users to always have
# the same set of default hosts, and is not using a batch scheduler
# (such as SLURM, PBS, etc.).
#
# Note that this file is *not* used when running in "managed"
# environments (e.g., running in a job under a job scheduler, such as
# SLURM or PBS / Torque).
#
# If you are primarily interested in running Open MPI on one node, you
# should *not* simply list "localhost" in here (contrary to prior MPI
# implementations, such as LAM/MPI). A localhost-only node list is
# created by the RAS component named "localhost" if no other RAS
# components were able to find any hosts to run on (this behavior can
# be disabled by excluding the localhost RAS component by specifying
# the value "^localhost" [without the quotes] to the "ras" MCA
# parameter).
(15:53:52) [zorg]:~>
Thanks!
Eric
Post by r***@open-mpi.org
What is in your hostfile?
Post by Eric Chamberland
Hi,
mpirun -n 8 echo "hello"
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 8 slots
echo
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
I have to oversubscribe, so what do I have to do to bypass this "limitation"?
Thanks,
Eric
http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h46m08s_config.log <http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h46m08s_config.log>
http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h46m08s_ompi_info_all.txt <http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h46m08s_ompi_info_all.txt>
[zorg:30036] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[zorg:30036] plm:base:set_hnp_name: initial bias 30036 nodename hash 810220270
[zorg:30036] plm:base:set_hnp_name: final jobfam 49136
[zorg:30036] [[49136,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[zorg:30036] [[49136,0],0] plm:base:receive start comm
[zorg:30036] [[49136,0],0] plm:base:setup_job
[zorg:30036] [[49136,0],0] plm:base:setup_vm
[zorg:30036] [[49136,0],0] plm:base:setup_vm creating map
[zorg:30036] [[49136,0],0] setup:vm: working unmanaged allocation
[zorg:30036] [[49136,0],0] using default hostfile /opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
[zorg:30036] [[49136,0],0] plm:base:setup_vm only HNP in allocation
[zorg:30036] [[49136,0],0] plm:base:setting slots for node zorg by cores
[zorg:30036] [[49136,0],0] complete_setup on job [49136,1]
[zorg:30036] [[49136,0],0] plm:base:launch_apps for job [49136,1]
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 8 slots
echo
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
[zorg:30036] [[49136,0],0] plm:base:orted_cmd sending orted_exit commands
[zorg:30036] [[49136,0],0] plm:base:receive stop comm
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
George Bosilca
2017-04-25 21:13:24 UTC
Permalink
Thanks Ralph,

Indeed, if I add :8 I get back the expected behavior. I can cope with this
(I don't usually restrict my runs to a subset of the nodes).

George.
Post by r***@open-mpi.org
I suspect it read the file just fine - what you are seeing in the output
is a reflection of the community’s design decision that only one slot would
be allocated for each time a node is listed in -host. This is why they
added the :N modifier so you can specify the #slots to use in lieu of
writing the host name N times
If this isn’t what you feel it should do, then please look at the files in
orte/util/dash_host and feel free to propose a modification to the
behavior. I personally am not bound to any particular answer, but I really
don’t have time to address it again.
dancer00 slots=8
dancer01 slots=8
The output regarding the 2 nodes (dancer00 and dancer01) is clearly wrong.
George.
Post by George Bosilca
I confirm a similar issue on a more managed environment. I have an
hostfile that worked for the last few years, and that span across a small
cluster (30 nodes of 8 cores each).
Trying to spawn any number of processes across P nodes fails if the
number of processes is larger than P (despite the fact that there are
largely enough resources, and that this information is provided via the
hostfile).
George.
$ mpirun -mca ras_base_verbose 10 --display-allocation -np 4 --host
dancer00,dancer01 --map-by
[dancer.icl.utk.edu:13457] mca: base: components_register: registering
framework ras components
[dancer.icl.utk.edu:13457] mca: base: components_register: found loaded
component simulator
[dancer.icl.utk.edu:13457] mca: base: components_register: component
simulator register function successful
[dancer.icl.utk.edu:13457] mca: base: components_register: found loaded
component slurm
[dancer.icl.utk.edu:13457] mca: base: components_register: component
slurm register function successful
[dancer.icl.utk.edu:13457] mca: base: components_register: found loaded
component loadleveler
[dancer.icl.utk.edu:13457] mca: base: components_register: component
loadleveler register function successful
[dancer.icl.utk.edu:13457] mca: base: components_register: found loaded
component tm
[dancer.icl.utk.edu:13457] mca: base: components_register: component tm
register function successful
[dancer.icl.utk.edu:13457] mca: base: components_open: opening ras
components
[dancer.icl.utk.edu:13457] mca: base: components_open: found loaded
component simulator
[dancer.icl.utk.edu:13457] mca: base: components_open: found loaded
component slurm
[dancer.icl.utk.edu:13457] mca: base: components_open: component slurm
open function successful
[dancer.icl.utk.edu:13457] mca: base: components_open: found loaded
component loadleveler
[dancer.icl.utk.edu:13457] mca: base: components_open: component
loadleveler open function successful
[dancer.icl.utk.edu:13457] mca: base: components_open: found loaded
component tm
[dancer.icl.utk.edu:13457] mca: base: components_open: component tm open
function successful
[dancer.icl.utk.edu:13457] mca:base:select: Auto-selecting ras components
[dancer.icl.utk.edu:13457] mca:base:select:( ras) Querying component
[simulator]
[dancer.icl.utk.edu:13457] mca:base:select:( ras) Querying component
[slurm]
[dancer.icl.utk.edu:13457] mca:base:select:( ras) Querying component
[loadleveler]
[dancer.icl.utk.edu:13457] mca:base:select:( ras) Querying component [tm]
[dancer.icl.utk.edu:13457] mca:base:select:( ras) No component selected!
====================== ALLOCATED NODES ======================
dancer00: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer01: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer02: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer03: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer04: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer05: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer06: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer07: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer08: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer09: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer10: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer11: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer12: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer13: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer14: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer15: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer16: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer17: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer18: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer19: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer20: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer21: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer22: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer23: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer24: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer25: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer26: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer27: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer28: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer29: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer30: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer31: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
====================== ALLOCATED NODES ======================
dancer00: flags=0x13 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer01: flags=0x13 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer02: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer03: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer04: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer05: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer06: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer07: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer08: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer09: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer10: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer11: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer12: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer13: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer14: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer15: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer16: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer17: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer18: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer19: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer20: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer21: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer22: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer23: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer24: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer25: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer26: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer27: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer28: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer29: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer30: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer31: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
------------------------------------------------------------
--------------
There are not enough slots available in the system to satisfy the 4 slots
startup
Either request fewer slots for your application, or make more slots available
for use.
------------------------------------------------------------
--------------
Post by r***@open-mpi.org
Okay - so effectively you have no hostfile, and no allocation. So this
is running just on the one node where mpirun exists?
Add “-mca ras_base_verbose 10 --display-allocation” to your cmd line and
let’s see what it found
On Apr 25, 2017, at 12:56 PM, Eric Chamberland <
Hi,
the host file has been constructed automatically by the
configuration+installation process and seems to contain only comments and a
(15:53:50) [zorg]:~> cat /opt/openmpi-3.x_debug/etc/ope
nmpi-default-hostfile
#
# Copyright (c) 2004-2005 The Trustees of Indiana University and
Indiana
# University Research and Technology
# Corporation. All rights reserved.
# Copyright (c) 2004-2005 The University of Tennessee and The
University
# of Tennessee Research Foundation. All rights
# reserved.
# Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
# University of Stuttgart. All rights
reserved.
# Copyright (c) 2004-2005 The Regents of the University of California.
# All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
# This is the default hostfile for Open MPI. Notice that it does not
# contain any hosts (not even localhost). This file should only
# contain hosts if a system administrator wants users to always have
# the same set of default hosts, and is not using a batch scheduler
# (such as SLURM, PBS, etc.).
#
# Note that this file is *not* used when running in "managed"
# environments (e.g., running in a job under a job scheduler, such as
# SLURM or PBS / Torque).
#
# If you are primarily interested in running Open MPI on one node, you
# should *not* simply list "localhost" in here (contrary to prior MPI
# implementations, such as LAM/MPI). A localhost-only node list is
# created by the RAS component named "localhost" if no other RAS
# components were able to find any hosts to run on (this behavior can
# be disabled by excluding the localhost RAS component by specifying
# the value "^localhost" [without the quotes] to the "ras" MCA
# parameter).
(15:53:52) [zorg]:~>
Thanks!
Eric
Post by r***@open-mpi.org
What is in your hostfile?
On Apr 25, 2017, at 11:39 AM, Eric Chamberland <
Hi,
mpirun -n 8 echo "hello"
------------------------------------------------------------
--------------
Post by r***@open-mpi.org
There are not enough slots available in the system to satisfy the 8
slots
Post by r***@open-mpi.org
echo
Either request fewer slots for your application, or make more slots
available
Post by r***@open-mpi.org
for use.
------------------------------------------------------------
--------------
Post by r***@open-mpi.org
I have to oversubscribe, so what do I have to do to bypass this
"limitation"?
Post by r***@open-mpi.org
Thanks,
Eric
http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h
46m08s_config.log
Post by r***@open-mpi.org
http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h
46m08s_ompi_info_all.txt
Post by r***@open-mpi.org
[zorg:30036] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh
path NULL
Post by r***@open-mpi.org
[zorg:30036] plm:base:set_hnp_name: initial bias 30036 nodename hash
810220270
Post by r***@open-mpi.org
[zorg:30036] plm:base:set_hnp_name: final jobfam 49136
[zorg:30036] [[49136,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[zorg:30036] [[49136,0],0] plm:base:receive start comm
[zorg:30036] [[49136,0],0] plm:base:setup_job
[zorg:30036] [[49136,0],0] plm:base:setup_vm
[zorg:30036] [[49136,0],0] plm:base:setup_vm creating map
[zorg:30036] [[49136,0],0] setup:vm: working unmanaged allocation
[zorg:30036] [[49136,0],0] using default hostfile
/opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
Post by r***@open-mpi.org
[zorg:30036] [[49136,0],0] plm:base:setup_vm only HNP in allocation
[zorg:30036] [[49136,0],0] plm:base:setting slots for node zorg by
cores
Post by r***@open-mpi.org
[zorg:30036] [[49136,0],0] complete_setup on job [49136,1]
[zorg:30036] [[49136,0],0] plm:base:launch_apps for job [49136,1]
------------------------------------------------------------
--------------
Post by r***@open-mpi.org
There are not enough slots available in the system to satisfy the 8
slots
Post by r***@open-mpi.org
echo
Either request fewer slots for your application, or make more slots
available
Post by r***@open-mpi.org
for use.
------------------------------------------------------------
--------------
Post by r***@open-mpi.org
[zorg:30036] [[49136,0],0] plm:base:orted_cmd sending orted_exit
commands
Post by r***@open-mpi.org
[zorg:30036] [[49136,0],0] plm:base:receive stop comm
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
r***@open-mpi.org
2017-04-25 21:29:39 UTC
Permalink
If it helps, I believe I added the ability to just use ‘:*’ to indicate “take them all” so you don’t have to remember the number.
Post by George Bosilca
Thanks Ralph,
Indeed, if I add :8 I get back the expected behavior. I can cope with this (I don't usually restrict my runs to a subset of the nodes).
George.
I suspect it read the file just fine - what you are seeing in the output is a reflection of the community’s design decision that only one slot would be allocated for each time a node is listed in -host. This is why they added the :N modifier so you can specify the #slots to use in lieu of writing the host name N times
If this isn’t what you feel it should do, then please look at the files in orte/util/dash_host and feel free to propose a modification to the behavior. I personally am not bound to any particular answer, but I really don’t have time to address it again.
Post by George Bosilca
dancer00 slots=8
dancer01 slots=8
The output regarding the 2 nodes (dancer00 and dancer01) is clearly wrong.
George.
I confirm a similar issue on a more managed environment. I have an hostfile that worked for the last few years, and that span across a small cluster (30 nodes of 8 cores each).
Trying to spawn any number of processes across P nodes fails if the number of processes is larger than P (despite the fact that there are largely enough resources, and that this information is provided via the hostfile).
George.
$ mpirun -mca ras_base_verbose 10 --display-allocation -np 4 --host dancer00,dancer01 --map-by
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_register: registering framework ras components
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_register: found loaded component simulator
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_register: component simulator register function successful
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_register: found loaded component slurm
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_register: component slurm register function successful
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_register: found loaded component loadleveler
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_register: component loadleveler register function successful
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_register: found loaded component tm
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_register: component tm register function successful
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_open: opening ras components
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_open: found loaded component simulator
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_open: found loaded component slurm
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_open: component slurm open function successful
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_open: found loaded component loadleveler
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_open: component loadleveler open function successful
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_open: found loaded component tm
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_open: component tm open function successful
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca:base:select: Auto-selecting ras components
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca:base:select:( ras) Querying component [simulator]
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca:base:select:( ras) Querying component [slurm]
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca:base:select:( ras) Querying component [loadleveler]
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca:base:select:( ras) Querying component [tm]
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca:base:select:( ras) No component selected!
====================== ALLOCATED NODES ======================
dancer00: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer01: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer02: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer03: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer04: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer05: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer06: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer07: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer08: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer09: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer10: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer11: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer12: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer13: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer14: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer15: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer16: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer17: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer18: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer19: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer20: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer21: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer22: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer23: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer24: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer25: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer26: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer27: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer28: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer29: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer30: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer31: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
====================== ALLOCATED NODES ======================
dancer00: flags=0x13 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer01: flags=0x13 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer02: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer03: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer04: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer05: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer06: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer07: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer08: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer09: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer10: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer11: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer12: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer13: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer14: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer15: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer16: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer17: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer18: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer19: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer20: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer21: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer22: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer23: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer24: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer25: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer26: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer27: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer28: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer29: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer30: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer31: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 4 slots
startup
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
Okay - so effectively you have no hostfile, and no allocation. So this is running just on the one node where mpirun exists?
Add “-mca ras_base_verbose 10 --display-allocation” to your cmd line and let’s see what it found
Post by Eric Chamberland
Hi,
(15:53:50) [zorg]:~> cat /opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
#
# Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
# University Research and Technology
# Corporation. All rights reserved.
# Copyright (c) 2004-2005 The University of Tennessee and The University
# of Tennessee Research Foundation. All rights
# reserved.
# Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
# University of Stuttgart. All rights reserved.
# Copyright (c) 2004-2005 The Regents of the University of California.
# All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
# This is the default hostfile for Open MPI. Notice that it does not
# contain any hosts (not even localhost). This file should only
# contain hosts if a system administrator wants users to always have
# the same set of default hosts, and is not using a batch scheduler
# (such as SLURM, PBS, etc.).
#
# Note that this file is *not* used when running in "managed"
# environments (e.g., running in a job under a job scheduler, such as
# SLURM or PBS / Torque).
#
# If you are primarily interested in running Open MPI on one node, you
# should *not* simply list "localhost" in here (contrary to prior MPI
# implementations, such as LAM/MPI). A localhost-only node list is
# created by the RAS component named "localhost" if no other RAS
# components were able to find any hosts to run on (this behavior can
# be disabled by excluding the localhost RAS component by specifying
# the value "^localhost" [without the quotes] to the "ras" MCA
# parameter).
(15:53:52) [zorg]:~>
Thanks!
Eric
Post by r***@open-mpi.org
What is in your hostfile?
Post by Eric Chamberland
Hi,
mpirun -n 8 echo "hello"
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 8 slots
echo
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
I have to oversubscribe, so what do I have to do to bypass this "limitation"?
Thanks,
Eric
http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h46m08s_config.log <http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h46m08s_config.log>
http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h46m08s_ompi_info_all.txt <http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h46m08s_ompi_info_all.txt>
[zorg:30036] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[zorg:30036] plm:base:set_hnp_name: initial bias 30036 nodename hash 810220270
[zorg:30036] plm:base:set_hnp_name: final jobfam 49136
[zorg:30036] [[49136,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[zorg:30036] [[49136,0],0] plm:base:receive start comm
[zorg:30036] [[49136,0],0] plm:base:setup_job
[zorg:30036] [[49136,0],0] plm:base:setup_vm
[zorg:30036] [[49136,0],0] plm:base:setup_vm creating map
[zorg:30036] [[49136,0],0] setup:vm: working unmanaged allocation
[zorg:30036] [[49136,0],0] using default hostfile /opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
[zorg:30036] [[49136,0],0] plm:base:setup_vm only HNP in allocation
[zorg:30036] [[49136,0],0] plm:base:setting slots for node zorg by cores
[zorg:30036] [[49136,0],0] complete_setup on job [49136,1]
[zorg:30036] [[49136,0],0] plm:base:launch_apps for job [49136,1]
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 8 slots
echo
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
[zorg:30036] [[49136,0],0] plm:base:orted_cmd sending orted_exit commands
[zorg:30036] [[49136,0],0] plm:base:receive stop comm
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
r***@open-mpi.org
2017-04-25 20:40:04 UTC
Permalink
Sigh - that _is_ the requested behavior. The -host option defaults to indicating only one slot should be used on that node.
I confirm a similar issue on a more managed environment. I have an hostfile that worked for the last few years, and that span across a small cluster (30 nodes of 8 cores each).
Trying to spawn any number of processes across P nodes fails if the number of processes is larger than P (despite the fact that there are largely enough resources, and that this information is provided via the hostfile).
George.
$ mpirun -mca ras_base_verbose 10 --display-allocation -np 4 --host dancer00,dancer01 --map-by
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_register: registering framework ras components
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_register: found loaded component simulator
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_register: component simulator register function successful
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_register: found loaded component slurm
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_register: component slurm register function successful
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_register: found loaded component loadleveler
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_register: component loadleveler register function successful
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_register: found loaded component tm
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_register: component tm register function successful
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_open: opening ras components
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_open: found loaded component simulator
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_open: found loaded component slurm
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_open: component slurm open function successful
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_open: found loaded component loadleveler
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_open: component loadleveler open function successful
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_open: found loaded component tm
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca: base: components_open: component tm open function successful
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca:base:select: Auto-selecting ras components
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca:base:select:( ras) Querying component [simulator]
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca:base:select:( ras) Querying component [slurm]
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca:base:select:( ras) Querying component [loadleveler]
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca:base:select:( ras) Querying component [tm]
[dancer.icl.utk.edu:13457 <http://dancer.icl.utk.edu:13457/>] mca:base:select:( ras) No component selected!
====================== ALLOCATED NODES ======================
dancer00: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer01: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer02: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer03: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer04: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer05: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer06: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer07: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer08: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer09: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer10: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer11: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer12: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer13: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer14: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer15: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer16: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer17: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer18: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer19: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer20: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer21: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer22: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer23: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer24: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer25: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer26: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer27: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer28: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer29: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer30: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer31: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
====================== ALLOCATED NODES ======================
dancer00: flags=0x13 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer01: flags=0x13 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer02: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer03: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer04: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer05: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer06: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer07: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer08: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer09: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer10: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer11: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer12: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer13: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer14: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer15: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer16: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer17: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer18: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer19: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer20: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer21: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer22: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer23: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer24: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer25: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer26: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer27: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer28: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer29: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer30: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer31: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 4 slots
startup
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
Okay - so effectively you have no hostfile, and no allocation. So this is running just on the one node where mpirun exists?
Add “-mca ras_base_verbose 10 --display-allocation” to your cmd line and let’s see what it found
Post by Eric Chamberland
Hi,
(15:53:50) [zorg]:~> cat /opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
#
# Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
# University Research and Technology
# Corporation. All rights reserved.
# Copyright (c) 2004-2005 The University of Tennessee and The University
# of Tennessee Research Foundation. All rights
# reserved.
# Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
# University of Stuttgart. All rights reserved.
# Copyright (c) 2004-2005 The Regents of the University of California.
# All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
# This is the default hostfile for Open MPI. Notice that it does not
# contain any hosts (not even localhost). This file should only
# contain hosts if a system administrator wants users to always have
# the same set of default hosts, and is not using a batch scheduler
# (such as SLURM, PBS, etc.).
#
# Note that this file is *not* used when running in "managed"
# environments (e.g., running in a job under a job scheduler, such as
# SLURM or PBS / Torque).
#
# If you are primarily interested in running Open MPI on one node, you
# should *not* simply list "localhost" in here (contrary to prior MPI
# implementations, such as LAM/MPI). A localhost-only node list is
# created by the RAS component named "localhost" if no other RAS
# components were able to find any hosts to run on (this behavior can
# be disabled by excluding the localhost RAS component by specifying
# the value "^localhost" [without the quotes] to the "ras" MCA
# parameter).
(15:53:52) [zorg]:~>
Thanks!
Eric
Post by r***@open-mpi.org
What is in your hostfile?
Post by Eric Chamberland
Hi,
mpirun -n 8 echo "hello"
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 8 slots
echo
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
I have to oversubscribe, so what do I have to do to bypass this "limitation"?
Thanks,
Eric
http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h46m08s_config.log <http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h46m08s_config.log>
http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h46m08s_ompi_info_all.txt <http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h46m08s_ompi_info_all.txt>
[zorg:30036] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[zorg:30036] plm:base:set_hnp_name: initial bias 30036 nodename hash 810220270
[zorg:30036] plm:base:set_hnp_name: final jobfam 49136
[zorg:30036] [[49136,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[zorg:30036] [[49136,0],0] plm:base:receive start comm
[zorg:30036] [[49136,0],0] plm:base:setup_job
[zorg:30036] [[49136,0],0] plm:base:setup_vm
[zorg:30036] [[49136,0],0] plm:base:setup_vm creating map
[zorg:30036] [[49136,0],0] setup:vm: working unmanaged allocation
[zorg:30036] [[49136,0],0] using default hostfile /opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
[zorg:30036] [[49136,0],0] plm:base:setup_vm only HNP in allocation
[zorg:30036] [[49136,0],0] plm:base:setting slots for node zorg by cores
[zorg:30036] [[49136,0],0] complete_setup on job [49136,1]
[zorg:30036] [[49136,0],0] plm:base:launch_apps for job [49136,1]
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 8 slots
echo
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
[zorg:30036] [[49136,0],0] plm:base:orted_cmd sending orted_exit commands
[zorg:30036] [[49136,0],0] plm:base:receive stop comm
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Loading...