Discussion:
[OMPI users] OpenMPI 3.0.1 - mpirun hangs with 2 hosts
Max Mellette
2018-05-04 17:08:54 UTC
Permalink
Hello All,

I'm trying to set up OpenMPI 3.0.1 on a pair of linux machines, but I'm
running into a problem where mpirun hangs when I try to execute a simple
command across the two machines:

$ mpirun --host b09-30,b09-32 hostname

I'd appreciate any assistance with this problem. I'm a new MPI user and
suspect I'm just missing something, but have checked the documentation at
www.open-mpi.org and various forums and have not been able to figure it out.

Thanks,
Max

Here are some configuration details:

- Both machines running Ubuntu 16.04
- b09-30 is the local host
- b09-32 is remote host
- Installed OpenMPI 3.0.1 from .tar on both machines in /usr/local
(following instructions from www.open-mpi.org)
- Configured PATH and LD_LIBRARY_PATH on both machines
- Can ssh without prompt between machines
- UFW firewall is disabled on both machines

Here's some terminal output, including running the command above with --mca
plm_base_verbose 100 set:

***@b09-30:~$ sudo ufw status
Status: inactive
***@b09-30:~$ cat .bashrc
export
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
export LD_LIBRARY_PATH=/usr/local/lib
***@b09-30:~$ ssh b09-32 hostname
b09-32
***@b09-30:~$ mpirun --host b09-30 hostname
b09-30
***@b09-30:~$ mpirun --host b09-30,b09-32 --mca plm_base_verbose 100
hostname
[b09-30:76987] mca: base: components_register: registering framework plm
components
[b09-30:76987] mca: base: components_register: found loaded component slurm
[b09-30:76987] mca: base: components_register: component slurm register
function successful
[b09-30:76987] mca: base: components_register: found loaded component rsh
[b09-30:76987] mca: base: components_register: component rsh register
function successful
[b09-30:76987] mca: base: components_register: found loaded component
isolated
[b09-30:76987] mca: base: components_register: component isolated has no
register or open function
[b09-30:76987] mca: base: components_open: opening plm components
[b09-30:76987] mca: base: components_open: found loaded component slurm
[b09-30:76987] mca: base: components_open: component slurm open function
successful
[b09-30:76987] mca: base: components_open: found loaded component rsh
[b09-30:76987] mca: base: components_open: component rsh open function
successful
[b09-30:76987] mca: base: components_open: found loaded component isolated
[b09-30:76987] mca: base: components_open: component isolated open function
successful
[b09-30:76987] mca:base:select: Auto-selecting plm components
[b09-30:76987] mca:base:select:( plm) Querying component [slurm]
[b09-30:76987] mca:base:select:( plm) Querying component [rsh]
[b09-30:76987] mca:base:select:( plm) Query of component [rsh] set
priority to 10
[b09-30:76987] mca:base:select:( plm) Querying component [isolated]
[b09-30:76987] mca:base:select:( plm) Query of component [isolated] set
priority to 0
[b09-30:76987] mca:base:select:( plm) Selected component [rsh]
[b09-30:76987] mca: base: close: component slurm closed
[b09-30:76987] mca: base: close: unloading component slurm
[b09-30:76987] mca: base: close: component isolated closed
[b09-30:76987] mca: base: close: unloading component isolated
[b09-30:76987] [[36418,0],0] plm:rsh: final template argv:
/usr/bin/ssh <template> orted -mca ess "env" -mca ess_base_jobid
"2386690048" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "2"
-mca orte_node_regex "b[2:9]-30,b[2:9]-***@0(2)" -mca orte_hnp_uri
"2386690048.0;tcp://169.228.66.102,10.1.100.30:55714" --mca
plm_base_verbose "100" -mca plm "rsh" -mca pmix "^s1,s2,cray,isolated"
^C[b09-30:76987] mca: base: close: component rsh closed
[b09-30:76987] mca: base: close: unloading component rsh
***@b09-30:~$

(I have to kill the process or it will hang for an undetermined amount of
time > 10 minutes.)
Jeff Squyres (jsquyres)
2018-05-11 21:38:19 UTC
Permalink
Post by Max Mellette
$ mpirun --host b09-30,b09-32 hostname
Do you see the output from the 2 `hostname` commands when this runs? Or does it just hang with no output?
Post by Max Mellette
Status: inactive
export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
export LD_LIBRARY_PATH=/usr/local/lib
b09-32
b09-30
I'm interested to see if you get the output from "hostname" when you don't use `--mca plm_base_verbose 100`.

Also, when this hangs, what is left running on b09-30 and b09-32? Is it just mpirun? Or are there any orted processes, too?
--
Jeff Squyres
***@cisco.com
Gilles Gouaillardet
2018-05-12 08:56:15 UTC
Permalink
Max,

the 'T' state of the ssh process is very puzzling.

can you try to run
/usr/bin/ssh -x b09-32 orted
on b09-30 and see what happens ?
(it should fail with an error message, instead of hanging)

In order to check there is no firewall, can you run instead
iptables -L
Also, is 'selinux' enabled ? there could be some rules that prevent
'ssh' from working as expected


Cheers,

Gilles
Hi Jeff,
Thanks for the reply. FYI since I originally posted this, I uninstalled
OpenMPI 3.0.1 and installed 3.1.0, but I'm still experiencing the same
problem.
When I run the command without the `--mca plm_base_verbose 100` flag, it
hangs indefinitely with no output.
As far as I can tell, these are the additional processes running on each
user 361714 0.4 0.0 293016 8444 pts/0 Sl+ 15:10 0:00 mpirun
--host b09-30,b09-32 hostname
user 361719 0.0 0.0 37092 5112 pts/0 T 15:10 0:00
/usr/bin/ssh -x b09-32 orted -mca ess "env" -mca ess_base_jobid "638517248"
-mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex
"638517248.0;tcp://169.228.66.102,10.1.100.30:55090" -mca plm "rsh" -mca
pmix "^s1,s2,cray,isolated"
[accepted]
[net]
I only see orted showing up in the ssh flags on b09-30. Any ideas what I
should try next?
Thanks,
Max
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Max Mellette
2018-05-14 04:44:07 UTC
Permalink
Hi Gilles,

Thanks for the suggestions; the results are below. Any ideas where to go
from here?

----- Seems that selinux is not installed:

***@b09-30:~$ sestatus
The program 'sestatus' is currently not installed. You can install it by
typing:
sudo apt install policycoreutils

----- Output from orted:

***@b09-30:~$ /usr/bin/ssh -x b09-32 orted
[b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
ess_env_module.c at line 147
[b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file
util/session_dir.c at line 106
[b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file
util/session_dir.c at line 345
[b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file
base/ess_base_std_orted.c at line 270
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

orte_session_dir failed
--> Returned value Bad parameter (-5) instead of ORTE_SUCCESS
--------------------------------------------------------------------------

----- iptables rules:

***@b09-30:~$ sudo iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination
ufw-before-logging-input all -- anywhere anywhere
ufw-before-input all -- anywhere anywhere
ufw-after-input all -- anywhere anywhere
ufw-after-logging-input all -- anywhere anywhere
ufw-reject-input all -- anywhere anywhere
ufw-track-input all -- anywhere anywhere

Chain FORWARD (policy ACCEPT)
target prot opt source destination
ufw-before-logging-forward all -- anywhere anywhere
ufw-before-forward all -- anywhere anywhere
ufw-after-forward all -- anywhere anywhere
ufw-after-logging-forward all -- anywhere anywhere
ufw-reject-forward all -- anywhere anywhere
ufw-track-forward all -- anywhere anywhere

Chain OUTPUT (policy ACCEPT)
target prot opt source destination
ufw-before-logging-output all -- anywhere anywhere
ufw-before-output all -- anywhere anywhere
ufw-after-output all -- anywhere anywhere
ufw-after-logging-output all -- anywhere anywhere
ufw-reject-output all -- anywhere anywhere
ufw-track-output all -- anywhere anywhere

Chain ufw-after-forward (1 references)
target prot opt source destination

Chain ufw-after-input (1 references)
target prot opt source destination

Chain ufw-after-logging-forward (1 references)
target prot opt source destination

Chain ufw-after-logging-input (1 references)
target prot opt source destination

Chain ufw-after-logging-output (1 references)
target prot opt source destination

Chain ufw-after-output (1 references)
target prot opt source destination

Chain ufw-before-forward (1 references)
target prot opt source destination

Chain ufw-before-input (1 references)
target prot opt source destination

Chain ufw-before-logging-forward (1 references)
target prot opt source destination

Chain ufw-before-logging-input (1 references)
target prot opt source destination

Chain ufw-before-logging-output (1 references)
target prot opt source destination

Chain ufw-before-output (1 references)
target prot opt source destination

Chain ufw-reject-forward (1 references)
target prot opt source destination

Chain ufw-reject-input (1 references)
target prot opt source destination

Chain ufw-reject-output (1 references)
target prot opt source destination

Chain ufw-track-forward (1 references)
target prot opt source destination

Chain ufw-track-input (1 references)
target prot opt source destination

Chain ufw-track-output (1 references)
target prot opt source destination


Thanks,
Max
John Hearns via users
2018-05-14 07:29:18 UTC
Permalink
One very, very stupid question here. This arose over on the Slurm list
actually.
Those hostnames look like quite generic names, ie they are part of an HPC
cluster?
Do they happen to have independednt home directories for your userid?
Could that possibly make a difference to the MPI launcher?
Post by Max Mellette
Hi Gilles,
Thanks for the suggestions; the results are below. Any ideas where to go
from here?
The program 'sestatus' is currently not installed. You can install it by
sudo apt install policycoreutils
[b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
ess_env_module.c at line 147
[b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file
util/session_dir.c at line 106
[b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file
util/session_dir.c at line 345
[b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file
base/ess_base_std_orted.c at line 270
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
orte_session_dir failed
--> Returned value Bad parameter (-5) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
Chain INPUT (policy ACCEPT)
target prot opt source destination
ufw-before-logging-input all -- anywhere anywhere
ufw-before-input all -- anywhere anywhere
ufw-after-input all -- anywhere anywhere
ufw-after-logging-input all -- anywhere anywhere
ufw-reject-input all -- anywhere anywhere
ufw-track-input all -- anywhere anywhere
Chain FORWARD (policy ACCEPT)
target prot opt source destination
ufw-before-logging-forward all -- anywhere anywhere
ufw-before-forward all -- anywhere anywhere
ufw-after-forward all -- anywhere anywhere
ufw-after-logging-forward all -- anywhere anywhere
ufw-reject-forward all -- anywhere anywhere
ufw-track-forward all -- anywhere anywhere
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
ufw-before-logging-output all -- anywhere anywhere
ufw-before-output all -- anywhere anywhere
ufw-after-output all -- anywhere anywhere
ufw-after-logging-output all -- anywhere anywhere
ufw-reject-output all -- anywhere anywhere
ufw-track-output all -- anywhere anywhere
Chain ufw-after-forward (1 references)
target prot opt source destination
Chain ufw-after-input (1 references)
target prot opt source destination
Chain ufw-after-logging-forward (1 references)
target prot opt source destination
Chain ufw-after-logging-input (1 references)
target prot opt source destination
Chain ufw-after-logging-output (1 references)
target prot opt source destination
Chain ufw-after-output (1 references)
target prot opt source destination
Chain ufw-before-forward (1 references)
target prot opt source destination
Chain ufw-before-input (1 references)
target prot opt source destination
Chain ufw-before-logging-forward (1 references)
target prot opt source destination
Chain ufw-before-logging-input (1 references)
target prot opt source destination
Chain ufw-before-logging-output (1 references)
target prot opt source destination
Chain ufw-before-output (1 references)
target prot opt source destination
Chain ufw-reject-forward (1 references)
target prot opt source destination
Chain ufw-reject-input (1 references)
target prot opt source destination
Chain ufw-reject-output (1 references)
target prot opt source destination
Chain ufw-track-forward (1 references)
target prot opt source destination
Chain ufw-track-input (1 references)
target prot opt source destination
Chain ufw-track-output (1 references)
target prot opt source destination
Thanks,
Max
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Max Mellette
2018-05-14 16:40:39 UTC
Permalink
John,

Thanks for the suggestions. In this case there is no cluster manager / job
scheduler; these are just a couple of individual hosts in a rack. The
reason for the generic names is that I anonymized the full network address
in the previous posts, truncating to just the host name.

My home directory is network-mounted to both hosts. In fact, I uninstalled
OpenMPI 3.0.1 from /usr/local on both hosts, and installed OpenMPI 3.1.0
into my home directory at `/home/user/openmpi_install`, also updating
.bashrc appropriately:

***@b09-30:~$ cat .bashrc
export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/
sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/
user/openmpi_install/bin
export LD_LIBRARY_PATH=/home/user/openmpi_install/lib

So the environment should be the same on both hosts.

Thanks,
Max

On Mon, May 14, 2018 at 12:29 AM, John Hearns via users <
Post by John Hearns via users
One very, very stupid question here. This arose over on the Slurm list
actually.
Those hostnames look like quite generic names, ie they are part of an HPC
cluster?
Do they happen to have independednt home directories for your userid?
Could that possibly make a difference to the MPI launcher?
Post by Max Mellette
Hi Gilles,
Thanks for the suggestions; the results are below. Any ideas where to go
from here?
The program 'sestatus' is currently not installed. You can install it by
sudo apt install policycoreutils
[b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
ess_env_module.c at line 147
[b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file
util/session_dir.c at line 106
[b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file
util/session_dir.c at line 345
[b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file
base/ess_base_std_orted.c at line 270
------------------------------------------------------------
--------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
orte_session_dir failed
--> Returned value Bad parameter (-5) instead of ORTE_SUCCESS
------------------------------------------------------------
--------------
Chain INPUT (policy ACCEPT)
target prot opt source destination
ufw-before-logging-input all -- anywhere anywhere
ufw-before-input all -- anywhere anywhere
ufw-after-input all -- anywhere anywhere
ufw-after-logging-input all -- anywhere anywhere
ufw-reject-input all -- anywhere anywhere
ufw-track-input all -- anywhere anywhere
Chain FORWARD (policy ACCEPT)
target prot opt source destination
ufw-before-logging-forward all -- anywhere anywhere
ufw-before-forward all -- anywhere anywhere
ufw-after-forward all -- anywhere anywhere
ufw-after-logging-forward all -- anywhere anywhere
ufw-reject-forward all -- anywhere anywhere
ufw-track-forward all -- anywhere anywhere
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
ufw-before-logging-output all -- anywhere anywhere
ufw-before-output all -- anywhere anywhere
ufw-after-output all -- anywhere anywhere
ufw-after-logging-output all -- anywhere anywhere
ufw-reject-output all -- anywhere anywhere
ufw-track-output all -- anywhere anywhere
Chain ufw-after-forward (1 references)
target prot opt source destination
Chain ufw-after-input (1 references)
target prot opt source destination
Chain ufw-after-logging-forward (1 references)
target prot opt source destination
Chain ufw-after-logging-input (1 references)
target prot opt source destination
Chain ufw-after-logging-output (1 references)
target prot opt source destination
Chain ufw-after-output (1 references)
target prot opt source destination
Chain ufw-before-forward (1 references)
target prot opt source destination
Chain ufw-before-input (1 references)
target prot opt source destination
Chain ufw-before-logging-forward (1 references)
target prot opt source destination
Chain ufw-before-logging-input (1 references)
target prot opt source destination
Chain ufw-before-logging-output (1 references)
target prot opt source destination
Chain ufw-before-output (1 references)
target prot opt source destination
Chain ufw-reject-forward (1 references)
target prot opt source destination
Chain ufw-reject-input (1 references)
target prot opt source destination
Chain ufw-reject-output (1 references)
target prot opt source destination
Chain ufw-track-forward (1 references)
target prot opt source destination
Chain ufw-track-input (1 references)
target prot opt source destination
Chain ufw-track-output (1 references)
target prot opt source destination
Thanks,
Max
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Gus Correa
2018-05-14 18:41:22 UTC
Permalink
Hi Max

Just in case, as environment mix often happens.
Could it be that you are inadvertently picking another
installation of OpenMPI, perhaps installed from packages
in /usr , or /usr/local?
That's easy to check with 'which mpiexec' or
'which mpicc', for instance.

Have you tried to prepend (as opposed to append) OpenMPI
to your PATH? Say:

export
PATH='/home/user/openmpi_install/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin'

I hope this helps,
Gus Correa
Post by Max Mellette
John,
Thanks for the suggestions. In this case there is no cluster manager /
job scheduler; these are just a couple of individual hosts in a rack.
The reason for the generic names is that I anonymized the full network
address in the previous posts, truncating to just the host name.
My home directory is network-mounted to both hosts. In fact, I
uninstalled OpenMPI 3.0.1 from /usr/local on both hosts, and installed
OpenMPI 3.1.0 into my home directory at `/home/user/openmpi_install`,
export
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/user/openmpi_install/bin
export LD_LIBRARY_PATH=/home/user/openmpi_install/lib
So the environment should be the same on both hosts.
Thanks,
Max
On Mon, May 14, 2018 at 12:29 AM, John Hearns via users
One very, very stupid question here. This arose over on the Slurm
list actually.
Those hostnames look like quite generic names, ie they are part of
an HPC cluster?
Do they happen to have independednt home directories for your userid?
Could that possibly make a difference to the MPI launcher?
Hi Gilles,
Thanks for the suggestions; the results are below. Any ideas
where to go from here?
The program 'sestatus' is currently not installed. You can
sudo apt install policycoreutils
[b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in
file ess_env_module.c at line 147
[b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad
parameter in file util/session_dir.c at line 106
[b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad
parameter in file util/session_dir.c at line 345
[b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad
parameter in file base/ess_base_std_orted.c at line 270
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal
failure;
here's some additional information (which may only be relevant to an
  orte_session_dir failed
  --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
Chain INPUT (policy ACCEPT)
target     prot opt source               destination
ufw-before-logging-input  all  --  anywhere             anywhere
ufw-before-input  all  --  anywhere             anywhere
ufw-after-input  all  --  anywhere             anywhere
ufw-after-logging-input  all  --  anywhere             anywhere
ufw-reject-input  all  --  anywhere             anywhere
ufw-track-input  all  --  anywhere             anywhere
Chain FORWARD (policy ACCEPT)
target     prot opt source               destination
ufw-before-logging-forward  all  --  anywhere             anywhere
ufw-before-forward  all  --  anywhere             anywhere
ufw-after-forward  all  --  anywhere             anywhere
ufw-after-logging-forward  all  --  anywhere             anywhere
ufw-reject-forward  all  --  anywhere             anywhere
ufw-track-forward  all  --  anywhere             anywhere
Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
ufw-before-logging-output  all  --  anywhere             anywhere
ufw-before-output  all  --  anywhere             anywhere
ufw-after-output  all  --  anywhere             anywhere
ufw-after-logging-output  all  --  anywhere             anywhere
ufw-reject-output  all  --  anywhere             anywhere
ufw-track-output  all  --  anywhere             anywhere
Chain ufw-after-forward (1 references)
target     prot opt source               destination
Chain ufw-after-input (1 references)
target     prot opt source               destination
Chain ufw-after-logging-forward (1 references)
target     prot opt source               destination
Chain ufw-after-logging-input (1 references)
target     prot opt source               destination
Chain ufw-after-logging-output (1 references)
target     prot opt source               destination
Chain ufw-after-output (1 references)
target     prot opt source               destination
Chain ufw-before-forward (1 references)
target     prot opt source               destination
Chain ufw-before-input (1 references)
target     prot opt source               destination
Chain ufw-before-logging-forward (1 references)
target     prot opt source               destination
Chain ufw-before-logging-input (1 references)
target     prot opt source               destination
Chain ufw-before-logging-output (1 references)
target     prot opt source               destination
Chain ufw-before-output (1 references)
target     prot opt source               destination
Chain ufw-reject-forward (1 references)
target     prot opt source               destination
Chain ufw-reject-input (1 references)
target     prot opt source               destination
Chain ufw-reject-output (1 references)
target     prot opt source               destination
Chain ufw-track-forward (1 references)
target     prot opt source               destination
Chain ufw-track-input (1 references)
target     prot opt source               destination
Chain ufw-track-output (1 references)
target     prot opt source               destination
Thanks,
Max
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Max Mellette
2018-05-14 19:37:38 UTC
Permalink
Hi Gus,

Thanks for the suggestions. The correct version of openmpi seems to be
getting picked up; I also prepended .bashrc with the installation path like
you suggested, but it didn't seemed to help:

***@b09-30:~$ cat .bashrc
export
PATH=/home/user/openmpi_install/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
export LD_LIBRARY_PATH=/home/user/openmpi_install/lib
***@b09-30:~$ which mpicc
/home/user/openmpi_install/bin/mpicc
***@b09-30:~$ /usr/bin/ssh -x b09-32 orted
[b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
ess_env_module.c at line 147
[b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file
util/session_dir.c at line 106
[b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file
util/session_dir.c at line 345
[b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file
base/ess_base_std_orted.c at line 270
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

orte_session_dir failed
--> Returned value Bad parameter (-5) instead of ORTE_SUCCESS
--------------------------------------------------------------------------

Thanks,
Max
Post by Gus Correa
Hi Max
Just in case, as environment mix often happens.
Could it be that you are inadvertently picking another
installation of OpenMPI, perhaps installed from packages
in /usr , or /usr/local?
That's easy to check with 'which mpiexec' or
'which mpicc', for instance.
Have you tried to prepend (as opposed to append) OpenMPI
export PATH='/home/user/openmpi_install/bin:/usr/local/sbin:/usr/
local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/
local/games:/snap/bin'
I hope this helps,
Gus Correa
r***@open-mpi.org
2018-05-14 20:27:33 UTC
Permalink
You got that error because the orted is looking for its rank on the cmd line and not finding it.
Post by Max Mellette
Hi Gus,
export PATH=/home/user/openmpi_install/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
export LD_LIBRARY_PATH=/home/user/openmpi_install/lib
/home/user/openmpi_install/bin/mpicc
[b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ess_env_module.c at line 147
[b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file util/session_dir.c at line 106
[b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file util/session_dir.c at line 345
[b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file base/ess_base_std_orted.c at line 270
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
orte_session_dir failed
--> Returned value Bad parameter (-5) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
Thanks,
Max
Hi Max
Just in case, as environment mix often happens.
Could it be that you are inadvertently picking another
installation of OpenMPI, perhaps installed from packages
in /usr , or /usr/local?
That's easy to check with 'which mpiexec' or
'which mpicc', for instance.
Have you tried to prepend (as opposed to append) OpenMPI
export PATH='/home/user/openmpi_install/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin'
I hope this helps,
Gus Correa
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Gilles Gouaillardet
2018-05-14 23:39:58 UTC
Permalink
In the initial report, the /usr/bin/ssh process was in the 'T' state
(it generally hints the process is attached by a debugger)

/usr/bin/ssh -x b09-32 orted

did behave as expected (e.g. orted was executed, exited with an error
since the command line is invalid, and error message was received)


can you try to run

/home/user/openmpi_install/bin/mpirun --host b09-30,b09-32 hostname

and see how things go ? (since you simply 'ssh orted', an other orted
might be used)

If you are still facing the same hang with ssh in the 'T' state, can you
check the logs on b09-32 and see
if the sshd server was even contacted ? I can hardly make sense of this
error fwiw.


Cheers,

Gilles
Post by r***@open-mpi.org
You got that error because the orted is looking for its rank on the
cmd line and not finding it.
Post by Max Mellette
Hi Gus,
Thanks for the suggestions. The correct version of openmpi seems to
be getting picked up; I also prepended .bashrc with the installation
export
PATH=/home/user/openmpi_install/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
export LD_LIBRARY_PATH=/home/user/openmpi_install/lib
/home/user/openmpi_install/bin/mpicc
[b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
ess_env_module.c at line 147
[b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in
file util/session_dir.c at line 106
[b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in
file util/session_dir.c at line 345
[b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in
file base/ess_base_std_orted.c at line 270
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
  orte_session_dir failed
  --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
Thanks,
Max
Hi Max
Just in case, as environment mix often happens.
Could it be that you are inadvertently picking another
installation of OpenMPI, perhaps installed from packages
in /usr , or /usr/local?
That's easy to check with 'which mpiexec' or
'which mpicc', for instance.
Have you tried to prepend (as opposed to append) OpenMPI
export
PATH='/home/user/openmpi_install/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin'
I hope this helps,
Gus Correa
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Max Mellette
2018-05-15 05:39:17 UTC
Permalink
Thanks everyone for all your assistance. The problem seems to be resolved
now, although I'm not entirely sure why these changes made a difference.
There were two things I changed:

(1) I had some additional `export ...` lines in .bashrc before the `export
PATH=...` and `export LD_LIBRARY_PATH=...` lines. When I removed those
lines (and then later added them back in below the PATH and LD_LIBRARY_PATH
lines) mpirun worked. But only b09-30 was able to execute code on b09-32
and not the other way around.

(2) I passed IP addresses to mpirun instead of the hostnames (this didn't
work previously), and now mpirun works in both directions (b09-30 -> b09-32
and b09-32 -> b09-30). I added a 3rd host in the rack and mpirun still
works when passing IP addresses. For some reason using the host name
doesn't work despite the fact that I can use it to ssh.

Also FWIW I wasn't using a debugger.

Thanks again,
Max
Post by Gilles Gouaillardet
In the initial report, the /usr/bin/ssh process was in the 'T' state
(it generally hints the process is attached by a debugger)
/usr/bin/ssh -x b09-32 orted
did behave as expected (e.g. orted was executed, exited with an error
since the command line is invalid, and error message was received)
can you try to run
/home/user/openmpi_install/bin/mpirun --host b09-30,b09-32 hostname
and see how things go ? (since you simply 'ssh orted', an other orted
might be used)
If you are still facing the same hang with ssh in the 'T' state, can you
check the logs on b09-32 and see
if the sshd server was even contacted ? I can hardly make sense of this
error fwiw.
Cheers,
Gilles
Post by r***@open-mpi.org
You got that error because the orted is looking for its rank on the cmd
line and not finding it.
Post by Max Mellette
Hi Gus,
Thanks for the suggestions. The correct version of openmpi seems to be
getting picked up; I also prepended .bashrc with the installation path like
export PATH=/home/user/openmpi_install/bin:/usr/local/sbin:/usr/
local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/
local/games:/snap/bin
export LD_LIBRARY_PATH=/home/user/openmpi_install/lib
/home/user/openmpi_install/bin/mpicc
[b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
ess_env_module.c at line 147
[b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in
file util/session_dir.c at line 106
[b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in
file util/session_dir.c at line 345
[b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in
file base/ess_base_std_orted.c at line 270
------------------------------------------------------------
--------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
orte_session_dir failed
--> Returned value Bad parameter (-5) instead of ORTE_SUCCESS
------------------------------------------------------------
--------------
Thanks,
Max
Hi Max
Just in case, as environment mix often happens.
Could it be that you are inadvertently picking another
installation of OpenMPI, perhaps installed from packages
in /usr , or /usr/local?
That's easy to check with 'which mpiexec' or
'which mpicc', for instance.
Have you tried to prepend (as opposed to append) OpenMPI
export
PATH='/home/user/openmpi_install/bin:/usr/local/sbin:/usr/
local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/
local/games:/snap/bin'
I hope this helps,
Gus Correa
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Gustavo Correa
2018-05-15 14:56:30 UTC
Permalink
Hi Max

Name resolution in /etc/hosts is a simple solution for (2).

I hope this helps,
Gus
(1) I had some additional `export ...` lines in .bashrc before the `export PATH=...` and `export LD_LIBRARY_PATH=...` lines. When I removed those lines (and then later added them back in below the PATH and LD_LIBRARY_PATH lines) mpirun worked. But only b09-30 was able to execute code on b09-32 and not the other way around.
(2) I passed IP addresses to mpirun instead of the hostnames (this didn't work previously), and now mpirun works in both directions (b09-30 -> b09-32 and b09-32 -> b09-30). I added a 3rd host in the rack and mpirun still works when passing IP addresses. For some reason using the host name doesn't work despite the fact that I can use it to ssh.
Also FWIW I wasn't using a debugger.
Thanks again,
Max
In the initial report, the /usr/bin/ssh process was in the 'T' state
(it generally hints the process is attached by a debugger)
/usr/bin/ssh -x b09-32 orted
did behave as expected (e.g. orted was executed, exited with an error since the command line is invalid, and error message was received)
can you try to run
/home/user/openmpi_install/bin/mpirun --host b09-30,b09-32 hostname
and see how things go ? (since you simply 'ssh orted', an other orted might be used)
If you are still facing the same hang with ssh in the 'T' state, can you check the logs on b09-32 and see
if the sshd server was even contacted ? I can hardly make sense of this error fwiw.
Cheers,
Gilles
You got that error because the orted is looking for its rank on the cmd line and not finding it.
Hi Gus,
export PATH=/home/user/openmpi_install/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
export LD_LIBRARY_PATH=/home/user/openmpi_install/lib
/home/user/openmpi_install/bin/mpicc
[b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ess_env_module.c at line 147
[b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file util/session_dir.c at line 106
[b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file util/session_dir.c at line 345
[b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file base/ess_base_std_orted.c at line 270
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
orte_session_dir failed
--> Returned value Bad parameter (-5) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
Thanks,
Max
Hi Max
Just in case, as environment mix often happens.
Could it be that you are inadvertently picking another
installation of OpenMPI, perhaps installed from packages
in /usr , or /usr/local?
That's easy to check with 'which mpiexec' or
'which mpicc', for instance.
Have you tried to prepend (as opposed to append) OpenMPI
export
PATH='/home/user/openmpi_install/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin'
I hope this helps,
Gus Correa
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Jeff Squyres (jsquyres)
2018-05-15 22:36:06 UTC
Permalink
(1) I had some additional `export ...` lines in .bashrc before the `export PATH=...` and `export LD_LIBRARY_PATH=...` lines. When I removed those lines (and then later added them back in below the PATH and LD_LIBRARY_PATH lines) mpirun worked. But only b09-30 was able to execute code on b09-32 and not the other way around.
It depends on what those "export..." Lines were, and whether you moved them below where non-interactive shells exited your .bashrc.
(2) I passed IP addresses to mpirun instead of the hostnames (this didn't work previously), and now mpirun works in both directions (b09-30 -> b09-32 and b09-32 -> b09-30). I added a 3rd host in the rack and mpirun still works when passing IP addresses. For some reason using the host name doesn't work despite the fact that I can use it to ssh.
FWIW, that *shouldn't* matter. Gus pointed out that you can use /etc/hosts, but Open MPI should fully be able to use names instead of IP addresses.

If you're having problems with this, it makes me think that there may still be something weird in your environment, but hey, if you're ok using IP addresses and that's working -- might be good enough. :-)
--
Jeff Squyres
***@cisco.com
Jeff Squyres (jsquyres)
2018-05-14 21:45:10 UTC
Permalink
Yes, that "T" state is quite puzzling. You didn't attach a debugger or hit the ssh with a signal, did you?

(we had a similar situation on the devel list recently, but it only happened with a very old version of Slurm. We concluded that it was a SLURM bug that has since been fixed. And just to be sure, I just double checked: the srun that hangs in that case is *not* in the "T" state -- it's in the "S" state, which appears to be a normal state)
Post by Gilles Gouaillardet
Max,
the 'T' state of the ssh process is very puzzling.
can you try to run
/usr/bin/ssh -x b09-32 orted
on b09-30 and see what happens ?
(it should fail with an error message, instead of hanging)
In order to check there is no firewall, can you run instead
iptables -L
Also, is 'selinux' enabled ? there could be some rules that prevent
'ssh' from working as expected
Cheers,
Gilles
Hi Jeff,
Thanks for the reply. FYI since I originally posted this, I uninstalled
OpenMPI 3.0.1 and installed 3.1.0, but I'm still experiencing the same
problem.
When I run the command without the `--mca plm_base_verbose 100` flag, it
hangs indefinitely with no output.
As far as I can tell, these are the additional processes running on each
user 361714 0.4 0.0 293016 8444 pts/0 Sl+ 15:10 0:00 mpirun
--host b09-30,b09-32 hostname
user 361719 0.0 0.0 37092 5112 pts/0 T 15:10 0:00
/usr/bin/ssh -x b09-32 orted -mca ess "env" -mca ess_base_jobid "638517248"
-mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex
"638517248.0;tcp://169.228.66.102,10.1.100.30:55090" -mca plm "rsh" -mca
pmix "^s1,s2,cray,isolated"
[accepted]
[net]
I only see orted showing up in the ssh flags on b09-30. Any ideas what I
should try next?
Thanks,
Max
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
***@cisco.com
Loading...