Discussion:
[OMPI users] OpenMPI-2.1.0 problem with executing orted when using SGE
Heinz-Ado Arnolds
2017-03-22 09:44:24 UTC
Permalink
Dear users and developers,

first of all many thanks for all the great work you have done for OpenMPI!

Up to OpenMPI-1.10.6 the mechanism for starting orted was to use SGE/qrsh:
mpirun -np 8 --map-by ppr:4:node ./myid
/opt/sge-8.1.8/bin/lx-amd64/qrsh -inherit -nostdin -V <DNS-Name of Remote Machine> orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess "env" -mca orte_ess_jobid "1621884928" -mca orte_ess_vpid 1 -mca orte_ess_num_procs "2" -mca orte_hnp_uri "1621884928.0;tcp://<IP-addr of Master>:41031" -mca plm "rsh" -mca rmaps_base_mapping_policy "ppr:4:node" --tree-spawn

Now with OpenMPI-2.1.0 (and the release candidates) "ssh" is used to start orted:
mpirun -np 8 --map-by ppr:4:node -mca mca_base_env_list OMP_NUM_THREADS=5 ./myid
/usr/bin/ssh -x <DNS-Name of Remote Machine> PATH=/afs/...../openmpi-2.1.0/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/afs/...../openmpi-2.1.0/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/afs/...../openmpi-2.1.0/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /afs/...../openmpi-2.1.0/bin/orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess "env" -mca ess_base_jobid "1626013696" -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_hnp_uri "1626013696.0;usock;tcp://<IP-addr of Master>:43019" -mca plm_rsh_args "-x" -mca plm "rsh" -mca rmaps_base_mapping_policy "ppr:4:node" -mca pmix "^s1,s2,cray"

qrsh set the environment properly on the remote side, so that environment variables from job scripts are properly transferred. With the ssh variant the environment is not set properly on the remote side, and it seems that there are handling problems with Kerberos tickets and/or AFS tokens.

Is there any way to revert the 2.1.0 behavior to the 1.10.6 (use SGE/qrsh) one? Are there mca params to set this?

If you need more info, please let me know. (Job submitting machine and target cluster are the same with all tests. SW is residing in AFS directories visible on all machines. Parameter "plm_rsh_disable_qrsh" current value: "false")

Kind regards,

Heinz-Ado Arnolds
Reuti
2017-03-22 12:58:46 UTC
Permalink
Hi,
Post by Heinz-Ado Arnolds
Dear users and developers,
first of all many thanks for all the great work you have done for OpenMPI!
mpirun -np 8 --map-by ppr:4:node ./myid
/opt/sge-8.1.8/bin/lx-amd64/qrsh -inherit -nostdin -V <DNS-Name of Remote Machine> orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess "env" -mca orte_ess_jobid "1621884928" -mca orte_ess_vpid 1 -mca orte_ess_num_procs "2" -mca orte_hnp_uri "1621884928.0;tcp://<IP-addr of Master>:41031" -mca plm "rsh" -mca rmaps_base_mapping_policy "ppr:4:node" --tree-spawn
mpirun -np 8 --map-by ppr:4:node -mca mca_base_env_list OMP_NUM_THREADS=5 ./myid
/usr/bin/ssh -x <DNS-Name of Remote Machine> PATH=/afs/...../openmpi-2.1.0/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/afs/...../openmpi-2.1.0/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/afs/...../openmpi-2.1.0/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /afs/...../openmpi-2.1.0/bin/orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess "env" -mca ess_base_jobid "1626013696" -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_hnp_uri "1626013696.0;usock;tcp://<IP-addr of Master>:43019" -mca plm_rsh_args "-x" -mca plm "rsh" -mca rmaps_base_mapping_policy "ppr:4:node" -mca pmix "^s1,s2,cray"
qrsh set the environment properly on the remote side, so that environment variables from job scripts are properly transferred. With the ssh variant the environment is not set properly on the remote side, and it seems that there are handling problems with Kerberos tickets and/or AFS tokens.
Is there any way to revert the 2.1.0 behavior to the 1.10.6 (use SGE/qrsh) one? Are there mca params to set this?
If you need more info, please let me know. (Job submitting machine and target cluster are the same with all tests. SW is residing in AFS directories visible on all machines. Parameter "plm_rsh_disable_qrsh" current value: "false")
It looks like `mpirun` still needs:

-mca plm_rsh_agent foo

to allow SGE to be detected.

-- Reuti
Heinz-Ado Arnolds
2017-03-22 14:31:29 UTC
Permalink
Dear Reuti,

thanks a lot, you're right! But why did the default behavior change but not the value of this parameter:

2.1.0: MCA plm rsh: parameter "plm_rsh_agent" (current value: "ssh : rsh", data source: default, level: 2 user/detail, type: string, synonyms: pls_rsh_agent, orte_rsh_agent)
The command used to launch executables on remote nodes (typically either "ssh" or "rsh")

1.10.6: MCA plm: parameter "plm_rsh_agent" (current value: "ssh : rsh", data source: default, level: 2 user/detail, type: string, synonyms: pls_rsh_agent, orte_rsh_agent)
The command used to launch executables on remote nodes (typically either "ssh" or "rsh")

That means there must have been changes in the code regarding that, perhaps for detecting SGE? Do you know of a way to revert to the old style (e.g. configure option)? Otherwise all my users have to add this option.

Thanks again, and have a nice day

Ado Arnolds
Post by Reuti
Hi,
Post by Heinz-Ado Arnolds
Dear users and developers,
first of all many thanks for all the great work you have done for OpenMPI!
mpirun -np 8 --map-by ppr:4:node ./myid
/opt/sge-8.1.8/bin/lx-amd64/qrsh -inherit -nostdin -V <DNS-Name of Remote Machine> orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess "env" -mca orte_ess_jobid "1621884928" -mca orte_ess_vpid 1 -mca orte_ess_num_procs "2" -mca orte_hnp_uri "1621884928.0;tcp://<IP-addr of Master>:41031" -mca plm "rsh" -mca rmaps_base_mapping_policy "ppr:4:node" --tree-spawn
mpirun -np 8 --map-by ppr:4:node -mca mca_base_env_list OMP_NUM_THREADS=5 ./myid
/usr/bin/ssh -x <DNS-Name of Remote Machine> PATH=/afs/...../openmpi-2.1.0/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/afs/...../openmpi-2.1.0/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/afs/...../openmpi-2.1.0/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /afs/...../openmpi-2.1.0/bin/orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess "env" -mca ess_base_jobid "1626013696" -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_hnp_uri "1626013696.0;usock;tcp://<IP-addr of Master>:43019" -mca plm_rsh_args "-x" -mca plm "rsh" -mca rmaps_base_mapping_policy "ppr:4:node" -mca pmix "^s1,s2,cray"
qrsh set the environment properly on the remote side, so that environment variables from job scripts are properly transferred. With the ssh variant the environment is not set properly on the remote side, and it seems that there are handling problems with Kerberos tickets and/or AFS tokens.
Is there any way to revert the 2.1.0 behavior to the 1.10.6 (use SGE/qrsh) one? Are there mca params to set this?
If you need more info, please let me know. (Job submitting machine and target cluster are the same with all tests. SW is residing in AFS directories visible on all machines. Parameter "plm_rsh_disable_qrsh" current value: "false")
-mca plm_rsh_agent foo
to allow SGE to be detected.
-- Reuti
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Reuti
2017-03-22 14:49:12 UTC
Permalink
Post by Heinz-Ado Arnolds
Dear Reuti,
2.1.0: MCA plm rsh: parameter "plm_rsh_agent" (current value: "ssh : rsh", data source: default, level: 2 user/detail, type: string, synonyms: pls_rsh_agent, orte_rsh_agent)
The command used to launch executables on remote nodes (typically either "ssh" or "rsh")
1.10.6: MCA plm: parameter "plm_rsh_agent" (current value: "ssh : rsh", data source: default, level: 2 user/detail, type: string, synonyms: pls_rsh_agent, orte_rsh_agent)
The command used to launch executables on remote nodes (typically either "ssh" or "rsh")
That means there must have been changes in the code regarding that, perhaps for detecting SGE? Do you know of a way to revert to the old style (e.g. configure option)? Otherwise all my users have to add this option.
There was a discussion in https://github.com/open-mpi/ompi/issues/2947

For now you can make use of https://www.open-mpi.org/faq/?category=tuning#setting-mca-params

Essentially to have it set for all users automatically, put:

plm_rsh_agent=foo

in $prefix/etc/openmpi-mca-params.conf of your central Open MPI 2.1.0 installation.

-- Reuti
Post by Heinz-Ado Arnolds
Thanks again, and have a nice day
Ado Arnolds
Post by Reuti
Hi,
Post by Heinz-Ado Arnolds
Dear users and developers,
first of all many thanks for all the great work you have done for OpenMPI!
mpirun -np 8 --map-by ppr:4:node ./myid
/opt/sge-8.1.8/bin/lx-amd64/qrsh -inherit -nostdin -V <DNS-Name of Remote Machine> orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess "env" -mca orte_ess_jobid "1621884928" -mca orte_ess_vpid 1 -mca orte_ess_num_procs "2" -mca orte_hnp_uri "1621884928.0;tcp://<IP-addr of Master>:41031" -mca plm "rsh" -mca rmaps_base_mapping_policy "ppr:4:node" --tree-spawn
mpirun -np 8 --map-by ppr:4:node -mca mca_base_env_list OMP_NUM_THREADS=5 ./myid
/usr/bin/ssh -x <DNS-Name of Remote Machine> PATH=/afs/...../openmpi-2.1.0/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/afs/...../openmpi-2.1.0/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/afs/...../openmpi-2.1.0/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /afs/...../openmpi-2.1.0/bin/orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess "env" -mca ess_base_jobid "1626013696" -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_hnp_uri "1626013696.0;usock;tcp://<IP-addr of Master>:43019" -mca plm_rsh_args "-x" -mca plm "rsh" -mca rmaps_base_mapping_policy "ppr:4:node" -mca pmix "^s1,s2,cray"
qrsh set the environment properly on the remote side, so that environment variables from job scripts are properly transferred. With the ssh variant the environment is not set properly on the remote side, and it seems that there are handling problems with Kerberos tickets and/or AFS tokens.
Is there any way to revert the 2.1.0 behavior to the 1.10.6 (use SGE/qrsh) one? Are there mca params to set this?
If you need more info, please let me know. (Job submitting machine and target cluster are the same with all tests. SW is residing in AFS directories visible on all machines. Parameter "plm_rsh_disable_qrsh" current value: "false")
-mca plm_rsh_agent foo
to allow SGE to be detected.
-- Reuti
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
r***@open-mpi.org
2017-03-22 14:55:19 UTC
Permalink
Sorry folks - for some reason (probably timing for getting 2.1.0 out), the fix for this got pushed to v2.1.1 - see the PR here: https://github.com/open-mpi/ompi/pull/3163 <https://github.com/open-mpi/ompi/pull/3163>
Post by Heinz-Ado Arnolds
Dear Reuti,
2.1.0: MCA plm rsh: parameter "plm_rsh_agent" (current value: "ssh : rsh", data source: default, level: 2 user/detail, type: string, synonyms: pls_rsh_agent, orte_rsh_agent)
The command used to launch executables on remote nodes (typically either "ssh" or "rsh")
1.10.6: MCA plm: parameter "plm_rsh_agent" (current value: "ssh : rsh", data source: default, level: 2 user/detail, type: string, synonyms: pls_rsh_agent, orte_rsh_agent)
The command used to launch executables on remote nodes (typically either "ssh" or "rsh")
That means there must have been changes in the code regarding that, perhaps for detecting SGE? Do you know of a way to revert to the old style (e.g. configure option)? Otherwise all my users have to add this option.
There was a discussion in https://github.com/open-mpi/ompi/issues/2947 <https://github.com/open-mpi/ompi/issues/2947>
For now you can make use of https://www.open-mpi.org/faq/?category=tuning#setting-mca-params <https://www.open-mpi.org/faq/?category=tuning#setting-mca-params>
plm_rsh_agent=foo
in $prefix/etc/openmpi-mca-params.conf of your central Open MPI 2.1.0 installation.
-- Reuti
Post by Heinz-Ado Arnolds
Thanks again, and have a nice day
Ado Arnolds
Post by Reuti
Hi,
Post by Heinz-Ado Arnolds
Dear users and developers,
first of all many thanks for all the great work you have done for OpenMPI!
mpirun -np 8 --map-by ppr:4:node ./myid
/opt/sge-8.1.8/bin/lx-amd64/qrsh -inherit -nostdin -V <DNS-Name of Remote Machine> orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess "env" -mca orte_ess_jobid "1621884928" -mca orte_ess_vpid 1 -mca orte_ess_num_procs "2" -mca orte_hnp_uri "1621884928.0;tcp://<IP-addr of Master>:41031" -mca plm "rsh" -mca rmaps_base_mapping_policy "ppr:4:node" --tree-spawn
mpirun -np 8 --map-by ppr:4:node -mca mca_base_env_list OMP_NUM_THREADS=5 ./myid
/usr/bin/ssh -x <DNS-Name of Remote Machine> PATH=/afs/...../openmpi-2.1.0/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/afs/...../openmpi-2.1.0/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/afs/...../openmpi-2.1.0/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /afs/...../openmpi-2.1.0/bin/orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess "env" -mca ess_base_jobid "1626013696" -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_hnp_uri "1626013696.0;usock;tcp://<IP-addr of Master>:43019" -mca plm_rsh_args "-x" -mca plm "rsh" -mca rmaps_base_mapping_policy "ppr:4:node" -mca pmix "^s1,s2,cray"
qrsh set the environment properly on the remote side, so that environment variables from job scripts are properly transferred. With the ssh variant the environment is not set properly on the remote side, and it seems that there are handling problems with Kerberos tickets and/or AFS tokens.
Is there any way to revert the 2.1.0 behavior to the 1.10.6 (use SGE/qrsh) one? Are there mca params to set this?
If you need more info, please let me know. (Job submitting machine and target cluster are the same with all tests. SW is residing in AFS directories visible on all machines. Parameter "plm_rsh_disable_qrsh" current value: "false")
-mca plm_rsh_agent foo
to allow SGE to be detected.
-- Reuti
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
Heinz-Ado Arnolds
2017-03-27 16:03:42 UTC
Permalink
Dear rhc,
dear Reuti,

thanks for your valuable help!

Kind regards,

Ado Arnolds
Post by r***@open-mpi.org
Sorry folks - for some reason (probably timing for getting 2.1.0 out), the fix for this got pushed to v2.1.1 - see the PR here: https://github.com/open-mpi/ompi/pull/3163
Post by Reuti
Post by Heinz-Ado Arnolds
Dear Reuti,
2.1.0: MCA plm rsh: parameter "plm_rsh_agent" (current value: "ssh : rsh", data source: default, level: 2 user/detail, type: string, synonyms: pls_rsh_agent, orte_rsh_agent)
The command used to launch executables on remote nodes (typically either "ssh" or "rsh")
1.10.6: MCA plm: parameter "plm_rsh_agent" (current value: "ssh : rsh", data source: default, level: 2 user/detail, type: string, synonyms: pls_rsh_agent, orte_rsh_agent)
The command used to launch executables on remote nodes (typically either "ssh" or "rsh")
That means there must have been changes in the code regarding that, perhaps for detecting SGE? Do you know of a way to revert to the old style (e.g. configure option)? Otherwise all my users have to add this option.
There was a discussion in https://github.com/open-mpi/ompi/issues/2947
For now you can make use of https://www.open-mpi.org/faq/?category=tuning#setting-mca-params
plm_rsh_agent=foo
in $prefix/etc/openmpi-mca-params.conf of your central Open MPI 2.1.0 installation.
-- Reuti
Post by Heinz-Ado Arnolds
Thanks again, and have a nice day
Ado Arnolds
Post by Reuti
Hi,
Post by Heinz-Ado Arnolds
Dear users and developers,
first of all many thanks for all the great work you have done for OpenMPI!
mpirun -np 8 --map-by ppr:4:node ./myid
/opt/sge-8.1.8/bin/lx-amd64/qrsh -inherit -nostdin -V <DNS-Name of Remote Machine> orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess "env" -mca orte_ess_jobid "1621884928" -mca orte_ess_vpid 1 -mca orte_ess_num_procs "2" -mca orte_hnp_uri "1621884928.0;tcp://<IP-addr of Master>:41031" -mca plm "rsh" -mca rmaps_base_mapping_policy "ppr:4:node" --tree-spawn
mpirun -np 8 --map-by ppr:4:node -mca mca_base_env_list OMP_NUM_THREADS=5 ./myid
/usr/bin/ssh -x <DNS-Name of Remote Machine> PATH=/afs/...../openmpi-2.1.0/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/afs/...../openmpi-2.1.0/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/afs/...../openmpi-2.1.0/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /afs/...../openmpi-2.1.0/bin/orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess "env" -mca ess_base_jobid "1626013696" -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_hnp_uri "1626013696.0;usock;tcp://<IP-addr of Master>:43019" -mca plm_rsh_args "-x" -mca plm "rsh" -mca rmaps_base_mapping_policy "ppr:4:node" -mca pmix "^s1,s2,cray"
qrsh set the environment properly on the remote side, so that environment variables from job scripts are properly transferred. With the ssh variant the environment is not set properly on the remote side, and it seems that there are handling problems with Kerberos tickets and/or AFS tokens.
Is there any way to revert the 2.1.0 behavior to the 1.10.6 (use SGE/qrsh) one? Are there mca params to set this?
If you need more info, please let me know. (Job submitting machine and target cluster are the same with all tests. SW is residing in AFS directories visible on all machines. Parameter "plm_rsh_disable_qrsh" current value: "false")
-mca plm_rsh_agent foo
to allow SGE to be detected.
-- Reuti
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Loading...