Discussion:
[OMPI users] mpirun issue using more than 64 hosts
Adam Sylvester
2018-02-12 23:23:48 UTC
Permalink
I'm running OpenMPI 2.1.0, built from source, on RHEL 7. I'm using the
default ssh-based launcher, where I have my private ssh key on rank 0 and
the associated public key on all ranks. I create a hosts file with a list
of unique IPs, with the host that I'm running mpirun from on the first
line, and run this command:

mpirun -N 1 --bind-to none --hostfile hosts.txt hostname

This works fine up to 64 machines. At 65 or greater, I get ssh errors.
Frequently

Permission denied (publickey,gssapi-keyex,gssapi-with-mic)

though today another user got

Host key verification failed.

I have confirmed I can successfully manually ssh into these instances.
I've also written a loop in bash which will background an ssh sleep command
to > 64 instances and this succeeds.

From what I can tell, the /etc/ssh/ssh*config settings that limit ssh
connections have to do with inbound, not outbound limits, and I can prove
by running straight ssh commands that I'm not hitting a limit.

Is there something wrong with my mpirun syntax (I've run this way thousands
of times without issues with fewer than 64 hosts, and I know MPI is
frequently used on orders of magnitudes more hosts than this)? Or is this
a known bug that's addressed in a later MPI release?

Thanks for the help.
-Adam
Gilles Gouaillardet
2018-02-13 00:12:14 UTC
Permalink
Adam,

by default, when more than 64 hosts are involved, mpirun uses a tree
spawn in order to remote launch the orted daemons.

That means you have two options here :
- allow all compute nodes to ssh each other (e.g. the ssh private key
of *all* the nodes should be in *all* the authorized_keys
- do not use a tree spawn (e.g. mpirun --mca plm_rsh_no_tree_spawn true ...)

I recommend the first option, otherwise mpirun would fork&exec a large
number of ssh processes and hence use quite a lot of
resources on the node running mpirun.

Cheers,

Gilles
Post by Adam Sylvester
I'm running OpenMPI 2.1.0, built from source, on RHEL 7. I'm using the
default ssh-based launcher, where I have my private ssh key on rank 0 and
the associated public key on all ranks. I create a hosts file with a list
of unique IPs, with the host that I'm running mpirun from on the first line,
mpirun -N 1 --bind-to none --hostfile hosts.txt hostname
This works fine up to 64 machines. At 65 or greater, I get ssh errors.
Frequently
Permission denied (publickey,gssapi-keyex,gssapi-with-mic)
though today another user got
Host key verification failed.
I have confirmed I can successfully manually ssh into these instances. I've
also written a loop in bash which will background an ssh sleep command to >
64 instances and this succeeds.
From what I can tell, the /etc/ssh/ssh*config settings that limit ssh
connections have to do with inbound, not outbound limits, and I can prove by
running straight ssh commands that I'm not hitting a limit.
Is there something wrong with my mpirun syntax (I've run this way thousands
of times without issues with fewer than 64 hosts, and I know MPI is
frequently used on orders of magnitudes more hosts than this)? Or is this a
known bug that's addressed in a later MPI release?
Thanks for the help.
-Adam
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Adam Sylvester
2018-02-13 00:34:42 UTC
Permalink
Ahhhh... thanks Gilles. That makes sense. I was stuck thinking there was
an ssh problem on rank 0; it never occurred to me mpirun was doing
something clever there and that those ssh errors were from a different
instance altogether.

It's no problem to put my private key on all instances - I'll go that route.

-Adam

On Mon, Feb 12, 2018 at 7:12 PM, Gilles Gouaillardet <
Post by Gilles Gouaillardet
Adam,
by default, when more than 64 hosts are involved, mpirun uses a tree
spawn in order to remote launch the orted daemons.
- allow all compute nodes to ssh each other (e.g. the ssh private key
of *all* the nodes should be in *all* the authorized_keys
- do not use a tree spawn (e.g. mpirun --mca plm_rsh_no_tree_spawn true ...)
I recommend the first option, otherwise mpirun would fork&exec a large
number of ssh processes and hence use quite a lot of
resources on the node running mpirun.
Cheers,
Gilles
Post by Adam Sylvester
I'm running OpenMPI 2.1.0, built from source, on RHEL 7. I'm using the
default ssh-based launcher, where I have my private ssh key on rank 0 and
the associated public key on all ranks. I create a hosts file with a
list
Post by Adam Sylvester
of unique IPs, with the host that I'm running mpirun from on the first
line,
Post by Adam Sylvester
mpirun -N 1 --bind-to none --hostfile hosts.txt hostname
This works fine up to 64 machines. At 65 or greater, I get ssh errors.
Frequently
Permission denied (publickey,gssapi-keyex,gssapi-with-mic)
though today another user got
Host key verification failed.
I have confirmed I can successfully manually ssh into these instances.
I've
Post by Adam Sylvester
also written a loop in bash which will background an ssh sleep command
to >
Post by Adam Sylvester
64 instances and this succeeds.
From what I can tell, the /etc/ssh/ssh*config settings that limit ssh
connections have to do with inbound, not outbound limits, and I can
prove by
Post by Adam Sylvester
running straight ssh commands that I'm not hitting a limit.
Is there something wrong with my mpirun syntax (I've run this way
thousands
Post by Adam Sylvester
of times without issues with fewer than 64 hosts, and I know MPI is
frequently used on orders of magnitudes more hosts than this)? Or is
this a
Post by Adam Sylvester
known bug that's addressed in a later MPI release?
Thanks for the help.
-Adam
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Loading...