[OMPI users] Startup limited to 128 remote hosts in some situations?

Discussion:

Mark Dixon

2017-01-17 17:37:43 UTC

Hi,

While commissioning a new cluster, I wanted to run HPL across the whole
thing using openmpi 2.0.1.

I couldn't get it to start on more than 129 hosts under Son of Gridengine
(128 remote plus the localhost running the mpirun command). openmpi would
sit there, waiting for all the orted's to check in; however, there were
"only" a maximum of 128 qrsh processes, therefore a maximum of 128
orted's, therefore waiting a loooong time.

Increasing plm_rsh_num_concurrent beyond the default of 128 gets the job
to launch.

Is this intentional, please?

Doesn't openmpi use a tree-like startup sometimes - any particular reason
it's not using it here?

Cheers,

Mark

r***@open-mpi.org

2017-01-17 17:56:54 UTC

Permalink

As I recall, the problem was that qrsh isn’t available on the backend compute nodes, and so we can’t use a tree for launch. If that isn’t true, then we can certainly adjust it.

Hi,
While commissioning a new cluster, I wanted to run HPL across the whole thing using openmpi 2.0.1.
I couldn't get it to start on more than 129 hosts under Son of Gridengine (128 remote plus the localhost running the mpirun command). openmpi would sit there, waiting for all the orted's to check in; however, there were "only" a maximum of 128 qrsh processes, therefore a maximum of 128 orted's, therefore waiting a loooong time.
Increasing plm_rsh_num_concurrent beyond the default of 128 gets the job to launch.
Is this intentional, please?
Doesn't openmpi use a tree-like startup sometimes - any particular reason it's not using it here?
Cheers,
Mark
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

William Hay

2017-01-18 11:29:38 UTC

Permalink

As I recall, the problem was that qrsh isn???t available on the backend compute nodes, and so we can???t use a tree for launch. If that isn???t true, then we can certainly adjust it.

qrsh should be available on all nodes of a SoGE cluster but, depending on how things are set up, may not be
findable (ie not in the PATH) when you qrsh -inherit into a node. A workaround would be to start backend
processes with qrsh -inherit -v PATH which will copy the PATH from the master node to the slave node
process or otherwise pass the location of qrsh from one node or another. That of course assumes that
qrsh is in the same location on all nodes.

I've tested that it is possible to qrsh from the head node of a job to a slave node and then on to
another slave node by this method.

William

Hi,
While commissioning a new cluster, I wanted to run HPL across the whole thing using openmpi 2.0.1.
I couldn't get it to start on more than 129 hosts under Son of Gridengine (128 remote plus the localhost running the mpirun command). openmpi would sit there, waiting for all the orted's to check in; however, there were "only" a maximum of 128 qrsh processes, therefore a maximum of 128 orted's, therefore waiting a loooong time.
Increasing plm_rsh_num_concurrent beyond the default of 128 gets the job to launch.
Is this intentional, please?
Doesn't openmpi use a tree-like startup sometimes - any particular reason it's not using it here?

r***@open-mpi.org

2017-01-20 01:29:23 UTC

Permalink

I’ll create a patch that you can try - if it works okay, we can commit it

Post by William Hay

As I recall, the problem was that qrsh isn???t available on the backend compute nodes, and so we can???t use a tree for launch. If that isn???t true, then we can certainly adjust it.

qrsh should be available on all nodes of a SoGE cluster but, depending on how things are set up, may not be
findable (ie not in the PATH) when you qrsh -inherit into a node. A workaround would be to start backend
processes with qrsh -inherit -v PATH which will copy the PATH from the master node to the slave node
process or otherwise pass the location of qrsh from one node or another. That of course assumes that
qrsh is in the same location on all nodes.
I've tested that it is possible to qrsh from the head node of a job to a slave node and then on to
another slave node by this method.
William

Hi,
While commissioning a new cluster, I wanted to run HPL across the whole thing using openmpi 2.0.1.
I couldn't get it to start on more than 129 hosts under Son of Gridengine (128 remote plus the localhost running the mpirun command). openmpi would sit there, waiting for all the orted's to check in; however, there were "only" a maximum of 128 qrsh processes, therefore a maximum of 128 orted's, therefore waiting a loooong time.
Increasing plm_rsh_num_concurrent beyond the default of 128 gets the job to launch.
Is this intentional, please?
Doesn't openmpi use a tree-like startup sometimes - any particular reason it's not using it here?

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

r***@open-mpi.org

2017-01-20 14:38:26 UTC

Permalink

Well, it appears we are already forwarding all envars, which should include PATH. Here is the qrsh command line we use:

“qrsh --inherit --nostdin -V"

So would you please try the following patch:

diff --git a/orte/mca/plm/rsh/plm_rsh_component.c b/orte/mca/plm/rsh/plm_rsh_component.c
index 0183bcc..1cc5aa4 100644
--- a/orte/mca/plm/rsh/plm_rsh_component.c
+++ b/orte/mca/plm/rsh/plm_rsh_component.c
@@ -288,8 +288,6 @@ static int rsh_component_query(mca_base_module_t **module, int *priority)
}
mca_plm_rsh_component.agent = tmp;
mca_plm_rsh_component.using_qrsh = true;
- /* no tree spawn allowed under qrsh */
- mca_plm_rsh_component.no_tree_spawn = true;
goto success;
} else if (!mca_plm_rsh_component.disable_llspawn &&
NULL != getenv("LOADL_STEP_ID")) {

Post by r***@open-mpi.org
I’ll create a patch that you can try - if it works okay, we can commit it

Post by William Hay

As I recall, the problem was that qrsh isn???t available on the backend compute nodes, and so we can???t use a tree for launch. If that isn???t true, then we can certainly adjust it.

qrsh should be available on all nodes of a SoGE cluster but, depending on how things are set up, may not be
findable (ie not in the PATH) when you qrsh -inherit into a node. A workaround would be to start backend
processes with qrsh -inherit -v PATH which will copy the PATH from the master node to the slave node
process or otherwise pass the location of qrsh from one node or another. That of course assumes that
qrsh is in the same location on all nodes.
I've tested that it is possible to qrsh from the head node of a job to a slave node and then on to
another slave node by this method.
William

Hi,
While commissioning a new cluster, I wanted to run HPL across the whole thing using openmpi 2.0.1.
I couldn't get it to start on more than 129 hosts under Son of Gridengine (128 remote plus the localhost running the mpirun command). openmpi would sit there, waiting for all the orted's to check in; however, there were "only" a maximum of 128 qrsh processes, therefore a maximum of 128 orted's, therefore waiting a loooong time.
Increasing plm_rsh_num_concurrent beyond the default of 128 gets the job to launch.
Is this intentional, please?
Doesn't openmpi use a tree-like startup sometimes - any particular reason it's not using it here?

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Mark Dixon

2017-01-24 14:51:49 UTC

Permalink

Hi,

It works for me :)

Thanks!

Mark

âqrsh --inherit --nostdin -V"
diff --git a/orte/mca/plm/rsh/plm_rsh_component.c b/orte/mca/plm/rsh/plm_rsh_component.c
index 0183bcc..1cc5aa4 100644
--- a/orte/mca/plm/rsh/plm_rsh_component.c
+++ b/orte/mca/plm/rsh/plm_rsh_component.c
@@ -288,8 +288,6 @@ static int rsh_component_query(mca_base_module_t **module, int *priority)
}
mca_plm_rsh_component.agent = tmp;
mca_plm_rsh_component.using_qrsh = true;
- /* no tree spawn allowed under qrsh */
- mca_plm_rsh_component.no_tree_spawn = true;
goto success;
} else if (!mca_plm_rsh_component.disable_llspawn &&
NULL != getenv("LOADL_STEP_ID")) {

Iâll create a patch that you can try - if it works okay, we can commit it

Post by William Hay

As I recall, the problem was that qrsh isn???t available on the backend compute nodes, and so we can???t use a tree for launch. If that isn???t true, then we can certainly adjust it.

qrsh should be available on all nodes of a SoGE cluster but, depending on how things are set up, may not be
findable (ie not in the PATH) when you qrsh -inherit into a node. A workaround would be to start backend
processes with qrsh -inherit -v PATH which will copy the PATH from the master node to the slave node
process or otherwise pass the location of qrsh from one node or another. That of course assumes that
qrsh is in the same location on all nodes.
I've tested that it is possible to qrsh from the head node of a job to a slave node and then on to
another slave node by this method.
William

Hi,
While commissioning a new cluster, I wanted to run HPL across the whole thing using openmpi 2.0.1.
I couldn't get it to start on more than 129 hosts under Son of Gridengine (128 remote plus the localhost running the mpirun command). openmpi would sit there, waiting for all the orted's to check in; however, there were "only" a maximum of 128 qrsh processes, therefore a maximum of 128 orted's, therefore waiting a loooong time.
Increasing plm_rsh_num_concurrent beyond the default of 128 gets the job to launch.
Is this intentional, please?
Doesn't openmpi use a tree-like startup sometimes - any particular reason it's not using it here?

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

--
-------------------------------------------------------------------
Mark Dixon Email : ***@leeds.ac.uk
Advanced Research Computing (ARC) Tel (int): 35429
IT Services building Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-------------------------------------------------------------------