Discussion:
[OMPI users] Is gridengine integration broken in openmpi 2.0.2?
Mark Dixon
2017-02-03 16:10:44 UTC
Permalink
Hi,

Just tried upgrading from 2.0.1 to 2.0.2 and I'm getting error messages
that look like openmpi is using ssh to login to remote nodes instead of
qrsh (see below). Has anyone else noticed gridengine integration being
broken, or am I being dumb?

I built with "./configure
--prefix=/apps/developers/libraries/openmpi/2.0.2/1/intel-17.0.1
--with-sge --with-io-romio-flags=--with-file-system=lustre+ufs
--enable-mpi-cxx --with-cma"

Can see the gridengine component via:

$ ompi_info -a | grep gridengine
MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component v2.0.2)
MCA ras gridengine: ---------------------------------------------------
MCA ras gridengine: parameter "ras_gridengine_priority" (current value: "100", data source: default, level: 9 dev/all, type: int)
Priority of the gridengine ras component
MCA ras gridengine: parameter "ras_gridengine_verbose" (current value: "0", data source: default, level: 9 dev/all, type: int)
Enable verbose output for the gridengine ras component
MCA ras gridengine: parameter "ras_gridengine_show_jobid" (current value: "false", data source: default, level: 9 dev/all, type: bool)

Cheers,

Mark

ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Permission denied, please try again.
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Permission denied, please try again.
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password,hostbased).
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.

* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
Reuti
2017-02-03 16:16:35 UTC
Permalink
Hi,
Hi,
Just tried upgrading from 2.0.1 to 2.0.2 and I'm getting error messages that look like openmpi is using ssh to login to remote nodes instead of qrsh (see below). Has anyone else noticed gridengine integration being broken, or am I being dumb?
I built with "./configure --prefix=/apps/developers/libraries/openmpi/2.0.2/1/intel-17.0.1 --with-sge --with-io-romio-flags=--with-file-system=lustre+ufs --enable-mpi-cxx --with-cma"
SGE on its own is not configured to use SSH? (I mean the entries in `qconf -sconf` for rsh_command resp. daemon).

-- Reuti
$ ompi_info -a | grep gridengine
MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component v2.0.2)
MCA ras gridengine: ---------------------------------------------------
MCA ras gridengine: parameter "ras_gridengine_priority" (current value: "100", data source: default, level: 9 dev/all, type: int)
Priority of the gridengine ras component
MCA ras gridengine: parameter "ras_gridengine_verbose" (current value: "0", data source: default, level: 9 dev/all, type: int)
Enable verbose output for the gridengine ras component
MCA ras gridengine: parameter "ras_gridengine_show_jobid" (current value: "false", data source: default, level: 9 dev/all, type: bool)
Cheers,
Mark
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Permission denied, please try again.
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Permission denied, please try again.
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password,hostbased).
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Mark Dixon
2017-02-03 16:49:34 UTC
Permalink
On Fri, 3 Feb 2017, Reuti wrote:
...
Post by Reuti
SGE on its own is not configured to use SSH? (I mean the entries in
`qconf -sconf` for rsh_command resp. daemon).
...

Nope, everything left as the default:

$ qconf -sconf | grep _command
qlogin_command builtin
rlogin_command builtin
rsh_command builtin

I have 2.0.1 and 2.0.2 installed side by side. 2.0.1 is happy but 2.0.2
isn't.

I'll start digging, but I'd appreciate hearing from any other SGE user who
had tried 2.0.2 and tell me if it had worked for them, please? :)

Cheers,

Mark
r***@open-mpi.org
2017-02-03 16:56:31 UTC
Permalink
I do see a diff between 2.0.1 and 2.0.2 that might have a related impact. The way we handled the MCA param that specifies the launch agent (ssh, rsh, or whatever) was modified, and I don’t think the change is correct. It basically says that we don’t look for qrsh unless the MCA param has been changed from the coded default, which means we are not detecting SGE by default.

Try setting "-mca plm_rsh_agent foo" on your cmd line - that will get past the test, and then we should auto-detect SGE again
Post by Mark Dixon
...
Post by Reuti
SGE on its own is not configured to use SSH? (I mean the entries in `qconf -sconf` for rsh_command resp. daemon).
...
$ qconf -sconf | grep _command
qlogin_command builtin
rlogin_command builtin
rsh_command builtin
I have 2.0.1 and 2.0.2 installed side by side. 2.0.1 is happy but 2.0.2 isn't.
I'll start digging, but I'd appreciate hearing from any other SGE user who had tried 2.0.2 and tell me if it had worked for them, please? :)
Cheers,
Mark
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Glenn Johnson
2017-02-03 17:43:14 UTC
Permalink
Is this the same issue that was previously fixed in PR-1960?

https://github.com/open-mpi/ompi/pull/1960/files


Glenn
Post by r***@open-mpi.org
I do see a diff between 2.0.1 and 2.0.2 that might have a related impact.
The way we handled the MCA param that specifies the launch agent (ssh, rsh,
or whatever) was modified, and I don’t think the change is correct. It
basically says that we don’t look for qrsh unless the MCA param has been
changed from the coded default, which means we are not detecting SGE by
default.
Try setting "-mca plm_rsh_agent foo" on your cmd line - that will get past
the test, and then we should auto-detect SGE again
Post by Mark Dixon
...
Post by Reuti
SGE on its own is not configured to use SSH? (I mean the entries in
`qconf -sconf` for rsh_command resp. daemon).
Post by Mark Dixon
...
$ qconf -sconf | grep _command
qlogin_command builtin
rlogin_command builtin
rsh_command builtin
I have 2.0.1 and 2.0.2 installed side by side. 2.0.1 is happy but 2.0.2
isn't.
Post by Mark Dixon
I'll start digging, but I'd appreciate hearing from any other SGE user
who had tried 2.0.2 and tell me if it had worked for them, please? :)
Post by Mark Dixon
Cheers,
Mark
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
r***@open-mpi.org
2017-02-04 01:02:59 UTC
Permalink
I don’t think so - at least, that isn’t the code I was looking at.
Post by Glenn Johnson
Is this the same issue that was previously fixed in PR-1960?
https://github.com/open-mpi/ompi/pull/1960/files <https://github.com/open-mpi/ompi/pull/1960/files>
Glenn
I do see a diff between 2.0.1 and 2.0.2 that might have a related impact. The way we handled the MCA param that specifies the launch agent (ssh, rsh, or whatever) was modified, and I don’t think the change is correct. It basically says that we don’t look for qrsh unless the MCA param has been changed from the coded default, which means we are not detecting SGE by default.
Try setting "-mca plm_rsh_agent foo" on your cmd line - that will get past the test, and then we should auto-detect SGE again
Post by Mark Dixon
...
Post by Reuti
SGE on its own is not configured to use SSH? (I mean the entries in `qconf -sconf` for rsh_command resp. daemon).
...
$ qconf -sconf | grep _command
qlogin_command builtin
rlogin_command builtin
rsh_command builtin
I have 2.0.1 and 2.0.2 installed side by side. 2.0.1 is happy but 2.0.2 isn't.
I'll start digging, but I'd appreciate hearing from any other SGE user who had tried 2.0.2 and tell me if it had worked for them, please? :)
Cheers,
Mark
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Mark Dixon
2017-02-06 12:38:16 UTC
Permalink
Post by r***@open-mpi.org
I do see a diff between 2.0.1 and 2.0.2 that might have a related
impact. The way we handled the MCA param that specifies the launch agent
(ssh, rsh, or whatever) was modified, and I don’t think the change is
correct. It basically says that we don’t look for qrsh unless the MCA
param has been changed from the coded default, which means we are not
detecting SGE by default.
Try setting "-mca plm_rsh_agent foo" on your cmd line - that will get
past the test, and then we should auto-detect SGE again
...

Ah-ha! "-mca plm_rsh_agent foo" fixes it!

Thanks very much - presumably I can stick that in the system-wide
openmpi-mca-params.conf for now.

Cheers,

Mark
Mark Dixon
2017-02-06 16:45:59 UTC
Permalink
On Mon, 6 Feb 2017, Mark Dixon wrote:
...
Post by Mark Dixon
Ah-ha! "-mca plm_rsh_agent foo" fixes it!
Thanks very much - presumably I can stick that in the system-wide
openmpi-mca-params.conf for now.
...

Except if I do that, it means running ompi outside of the SGE environment
no longer works :(

Should I just revoke the following commit?

Cheers,

Mark

commit d51c2af76b0c011177aca8e08a5a5fcf9f5e67db
Author: Jeff Squyres <***@cisco.com>
Date: Tue Aug 16 06:58:20 2016 -0500

rsh: robustify the check for plm_rsh_agent default value

Don't strcmp against the default value -- the default value may change
over time. Instead, check to see if the MCA var source is not
DEFAULT.

Signed-off-by: Jeff Squyres <***@cisco.com>

(cherry picked from commit open-mpi/***@71ec5cfb436977ea9ad409ba634d27e6addf6fae)
Glenn Johnson
2017-02-09 23:01:12 UTC
Permalink
Will this be fixed in the 2.0.3 release?

Thanks.


Glenn
Post by Mark Dixon
...
Post by Mark Dixon
Ah-ha! "-mca plm_rsh_agent foo" fixes it!
Thanks very much - presumably I can stick that in the system-wide
openmpi-mca-params.conf for now.
...
Except if I do that, it means running ompi outside of the SGE environment
no longer works :(
Should I just revoke the following commit?
Cheers,
Mark
commit d51c2af76b0c011177aca8e08a5a5fcf9f5e67db
Date: Tue Aug 16 06:58:20 2016 -0500
rsh: robustify the check for plm_rsh_agent default value
Don't strcmp against the default value -- the default value may change
over time. Instead, check to see if the MCA var source is not
DEFAULT.
9ad409ba634d27e6addf6fae)
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Jeff Squyres (jsquyres)
2017-02-09 23:21:44 UTC
Permalink
Yes, we can get it fixed.

Ralph is unavailable this week; I don't know offhand what he meant by his prior remarks. It's possible that https://github.com/open-mpi/ompi/commit/71ec5cfb436977ea9ad409ba634d27e6addf6fae; can you try changing the "!=" on line to be "=="? I.e., from

if (MCA_BASE_VAR_SOURCE_DEFAULT != source) {

to

if (MCA_BASE_VAR_SOURCE_DEFAULT == source) {

I filed https://github.com/open-mpi/ompi/issues/2947 to track the issue.
Post by Glenn Johnson
Will this be fixed in the 2.0.3 release?
Thanks.
Glenn
...
Ah-ha! "-mca plm_rsh_agent foo" fixes it!
Thanks very much - presumably I can stick that in the system-wide openmpi-mca-params.conf for now.
...
Except if I do that, it means running ompi outside of the SGE environment no longer works :(
Should I just revoke the following commit?
Cheers,
Mark
commit d51c2af76b0c011177aca8e08a5a5fcf9f5e67db
Date: Tue Aug 16 06:58:20 2016 -0500
rsh: robustify the check for plm_rsh_agent default value
Don't strcmp against the default value -- the default value may change
over time. Instead, check to see if the MCA var source is not
DEFAULT.
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Jeff Squyres
***@cisco.com
r***@open-mpi.org
2017-02-12 21:52:40 UTC
Permalink
Yeah, I’ll fix it this week. The problem is that you can’t check the source as being default as the default is ssh - so the only way to get the current code to check for qrsh is to specify something other than the default ssh (it doesn’t matter what you specify - anything will get you past the erroneous check so you look for qrsh).
Post by Jeff Squyres (jsquyres)
Yes, we can get it fixed.
Ralph is unavailable this week; I don't know offhand what he meant by his prior remarks. It's possible that https://github.com/open-mpi/ompi/commit/71ec5cfb436977ea9ad409ba634d27e6addf6fae; can you try changing the "!=" on line to be "=="? I.e., from
if (MCA_BASE_VAR_SOURCE_DEFAULT != source) {
to
if (MCA_BASE_VAR_SOURCE_DEFAULT == source) {
I filed https://github.com/open-mpi/ompi/issues/2947 to track the issue.
Post by Glenn Johnson
Will this be fixed in the 2.0.3 release?
Thanks.
Glenn
...
Ah-ha! "-mca plm_rsh_agent foo" fixes it!
Thanks very much - presumably I can stick that in the system-wide openmpi-mca-params.conf for now.
...
Except if I do that, it means running ompi outside of the SGE environment no longer works :(
Should I just revoke the following commit?
Cheers,
Mark
commit d51c2af76b0c011177aca8e08a5a5fcf9f5e67db
Date: Tue Aug 16 06:58:20 2016 -0500
rsh: robustify the check for plm_rsh_agent default value
Don't strcmp against the default value -- the default value may change
over time. Instead, check to see if the MCA var source is not
DEFAULT.
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Jeff Squyres
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
r***@open-mpi.org
2017-02-14 04:56:41 UTC
Permalink
I dug into this further, and the simplest solution for now is to simply do one of the following:

* replace the “!=“ with “==“ in the test, as Jeff indicated; or

* revert the commit Mark identified

Both options will restore the original logic. Given that someone already got it wrong, I have clarified the logic in the OMPI master repo. However, I don’t know how long it will be before a 2.0.3 release is issued, so GridEngine users might want to locally fix things in the interim.
Post by r***@open-mpi.org
Yeah, I’ll fix it this week. The problem is that you can’t check the source as being default as the default is ssh - so the only way to get the current code to check for qrsh is to specify something other than the default ssh (it doesn’t matter what you specify - anything will get you past the erroneous check so you look for qrsh).
Post by Jeff Squyres (jsquyres)
Yes, we can get it fixed.
Ralph is unavailable this week; I don't know offhand what he meant by his prior remarks. It's possible that https://github.com/open-mpi/ompi/commit/71ec5cfb436977ea9ad409ba634d27e6addf6fae; can you try changing the "!=" on line to be "=="? I.e., from
if (MCA_BASE_VAR_SOURCE_DEFAULT != source) {
to
if (MCA_BASE_VAR_SOURCE_DEFAULT == source) {
I filed https://github.com/open-mpi/ompi/issues/2947 to track the issue.
Post by Glenn Johnson
Will this be fixed in the 2.0.3 release?
Thanks.
Glenn
...
Ah-ha! "-mca plm_rsh_agent foo" fixes it!
Thanks very much - presumably I can stick that in the system-wide openmpi-mca-params.conf for now.
...
Except if I do that, it means running ompi outside of the SGE environment no longer works :(
Should I just revoke the following commit?
Cheers,
Mark
commit d51c2af76b0c011177aca8e08a5a5fcf9f5e67db
Date: Tue Aug 16 06:58:20 2016 -0500
rsh: robustify the check for plm_rsh_agent default value
Don't strcmp against the default value -- the default value may change
over time. Instead, check to see if the MCA var source is not
DEFAULT.
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Jeff Squyres
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Loading...