Discussion:
[OMPI users] Cannot run a job with more than 3 nodes
Victor
2014-03-12 06:37:25 UTC
Permalink
I am using openmpi 1.7.4 on Ubuntu 12.04 x64 and I have a very odd problem.

I have 4 nodes, all of which are defined in the hostfile and in /etc/hosts.

I can log into each node using ssh and certificate method from the shell
that is running the mpi job, by sing their name as defined in /etc/hosts.

I can run an mpi job if I include only 3 nodes in the hostfile, for example:

Node1 slots=8 max-slots=8
Node2 slots=8 max-slots=8
Node3 slots=8 max-slots=8

But if I add a fourth node into the hostfile eg:

Node1 slots=8 max-slots=8
Node2 slots=8 max-slots=8
Node3 slots=8 max-slots=8
Node4 slots=8 max-slots=8

I get this error after attempting mpirun -np 32 --hostfile hostfile a.out:

ssh: Could not resolve hostname Node4: Name or service not known.

But, I can log into Node4 using ssh from the same shell by using ssh Node4.

Also if I mix up the hostfile like this for example and place Node1 to the
last spot:

Node4 slots=8 max-slots=8
Node2 slots=8 max-slots=8
Node3 slots=8 max-slots=8
Node1 slots=8 max-slots=8

The error becomes

ssh: Could not resolve hostname Node1: Name or service not known.

If I then go back to the three node hostfile like this:

Node1 slots=8 max-slots=8
Node4 slots=8 max-slots=8
Node2 slots=8 max-slots=8

There is no error with three nodes even though both Node1 and Node4 "cannot
be found" if they are present in a 4 node hostfile in the last spot. The
last slot seems to be bugged.

What is going on? How do I fix this?
Victor
2014-03-12 07:26:50 UTC
Permalink
I "fixed it" by finding the message regarding tree spawn in a thread from
November 2013. When I run the job with -mca plm_rsh_no_tree_spawn 1 the job
works over 4 nodes.

I cannot identify any errors in ssh key setup and since I am only using 4
nodes I am not concerned about somewhat slower launch speed. Is faster job
launch speed the only benefit of tree spawn?
Reuti
2014-03-12 08:01:18 UTC
Permalink
Hi,
Post by Victor
I am using openmpi 1.7.4 on Ubuntu 12.04 x64 and I have a very odd problem.
I have 4 nodes, all of which are defined in the hostfile and in /etc/hosts.
I can log into each node using ssh and certificate method from the shell that is running the mpi job, by sing their name as defined in /etc/hosts.
Node1 slots=8 max-slots=8
Node2 slots=8 max-slots=8
Node3 slots=8 max-slots=8
You are using an uppercase name here by intention - this is the one the host returns by `hostname`? Although it is allowed and should be mangled to lowercase resp. ignored for hostname resolution, I found that not all programs are doing it. Best is to use only lowercase characters is my experience.

The same version of your Ubuntu Linux is installed on all machines?

-- Reuti
Post by Victor
Node1 slots=8 max-slots=8
Node2 slots=8 max-slots=8
Node3 slots=8 max-slots=8
Node4 slots=8 max-slots=8
ssh: Could not resolve hostname Node4: Name or service not known.
But, I can log into Node4 using ssh from the same shell by using ssh Node4.
Node4 slots=8 max-slots=8
Node2 slots=8 max-slots=8
Node3 slots=8 max-slots=8
Node1 slots=8 max-slots=8
The error becomes
ssh: Could not resolve hostname Node1: Name or service not known.
Node1 slots=8 max-slots=8
Node4 slots=8 max-slots=8
Node2 slots=8 max-slots=8
There is no error with three nodes even though both Node1 and Node4 "cannot be found" if they are present in a 4 node hostfile in the last spot. The last slot seems to be bugged.
What is going on? How do I fix this?
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Victor
2014-03-12 08:07:31 UTC
Permalink
Hostname.... no I use lower case, but for some reason while I was writing
the email I thought that upper case is clearer...

The same version of Ubuntu (12.04 x64) is on all nodes and openmpi and the
executable are shared via nfs.
Post by Victor
Hi,
Post by Victor
I am using openmpi 1.7.4 on Ubuntu 12.04 x64 and I have a very odd
problem.
Post by Victor
I have 4 nodes, all of which are defined in the hostfile and in
/etc/hosts.
Post by Victor
I can log into each node using ssh and certificate method from the shell
that is running the mpi job, by sing their name as defined in /etc/hosts.
Post by Victor
I can run an mpi job if I include only 3 nodes in the hostfile, for
Node1 slots=8 max-slots=8
Node2 slots=8 max-slots=8
Node3 slots=8 max-slots=8
You are using an uppercase name here by intention - this is the one the
host returns by `hostname`? Although it is allowed and should be mangled to
lowercase resp. ignored for hostname resolution, I found that not all
programs are doing it. Best is to use only lowercase characters is my
experience.
The same version of your Ubuntu Linux is installed on all machines?
-- Reuti
Post by Victor
Node1 slots=8 max-slots=8
Node2 slots=8 max-slots=8
Node3 slots=8 max-slots=8
Node4 slots=8 max-slots=8
I get this error after attempting mpirun -np 32 --hostfile hostfile
ssh: Could not resolve hostname Node4: Name or service not known.
But, I can log into Node4 using ssh from the same shell by using ssh
Node4.
Post by Victor
Also if I mix up the hostfile like this for example and place Node1 to
Node4 slots=8 max-slots=8
Node2 slots=8 max-slots=8
Node3 slots=8 max-slots=8
Node1 slots=8 max-slots=8
The error becomes
ssh: Could not resolve hostname Node1: Name or service not known.
Node1 slots=8 max-slots=8
Node4 slots=8 max-slots=8
Node2 slots=8 max-slots=8
There is no error with three nodes even though both Node1 and Node4
"cannot be found" if they are present in a 4 node hostfile in the last
spot. The last slot seems to be bugged.
Post by Victor
What is going on? How do I fix this?
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Jeff Squyres (jsquyres)
2014-03-12 10:15:19 UTC
Permalink
Are all names resolvable from all servers?

I.e., if you "ssh Node4" from Node1, Node2, and Node3, does it work?
Hostname.... no I use lower case, but for some reason while I was writing the email I thought that upper case is clearer...
The same version of Ubuntu (12.04 x64) is on all nodes and openmpi and the executable are shared via nfs.
Hi,
Post by Victor
I am using openmpi 1.7.4 on Ubuntu 12.04 x64 and I have a very odd problem.
I have 4 nodes, all of which are defined in the hostfile and in /etc/hosts.
I can log into each node using ssh and certificate method from the shell that is running the mpi job, by sing their name as defined in /etc/hosts.
Node1 slots=8 max-slots=8
Node2 slots=8 max-slots=8
Node3 slots=8 max-slots=8
You are using an uppercase name here by intention - this is the one the host returns by `hostname`? Although it is allowed and should be mangled to lowercase resp. ignored for hostname resolution, I found that not all programs are doing it. Best is to use only lowercase characters is my experience.
The same version of your Ubuntu Linux is installed on all machines?
-- Reuti
Post by Victor
Node1 slots=8 max-slots=8
Node2 slots=8 max-slots=8
Node3 slots=8 max-slots=8
Node4 slots=8 max-slots=8
ssh: Could not resolve hostname Node4: Name or service not known.
But, I can log into Node4 using ssh from the same shell by using ssh Node4.
Node4 slots=8 max-slots=8
Node2 slots=8 max-slots=8
Node3 slots=8 max-slots=8
Node1 slots=8 max-slots=8
The error becomes
ssh: Could not resolve hostname Node1: Name or service not known.
Node1 slots=8 max-slots=8
Node4 slots=8 max-slots=8
Node2 slots=8 max-slots=8
There is no error with three nodes even though both Node1 and Node4 "cannot be found" if they are present in a 4 node hostfile in the last spot. The last slot seems to be bugged.
What is going on? How do I fix this?
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
***@cisco.com
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Victor
2014-03-12 11:34:11 UTC
Permalink
Yes they are. Can resolve and log into each node, from each node, using
their "friendly" name, not IP.
Post by Jeff Squyres (jsquyres)
Are all names resolvable from all servers?
I.e., if you "ssh Node4" from Node1, Node2, and Node3, does it work?
Post by Victor
Hostname.... no I use lower case, but for some reason while I was
writing the email I thought that upper case is clearer...
Post by Victor
The same version of Ubuntu (12.04 x64) is on all nodes and openmpi and
the executable are shared via nfs.
Post by Victor
Hi,
Post by Victor
I am using openmpi 1.7.4 on Ubuntu 12.04 x64 and I have a very odd
problem.
Post by Victor
Post by Victor
I have 4 nodes, all of which are defined in the hostfile and in
/etc/hosts.
Post by Victor
Post by Victor
I can log into each node using ssh and certificate method from the
shell that is running the mpi job, by sing their name as defined in
/etc/hosts.
Post by Victor
Post by Victor
I can run an mpi job if I include only 3 nodes in the hostfile, for
Node1 slots=8 max-slots=8
Node2 slots=8 max-slots=8
Node3 slots=8 max-slots=8
You are using an uppercase name here by intention - this is the one the
host returns by `hostname`? Although it is allowed and should be mangled to
lowercase resp. ignored for hostname resolution, I found that not all
programs are doing it. Best is to use only lowercase characters is my
experience.
Post by Victor
The same version of your Ubuntu Linux is installed on all machines?
-- Reuti
Post by Victor
Node1 slots=8 max-slots=8
Node2 slots=8 max-slots=8
Node3 slots=8 max-slots=8
Node4 slots=8 max-slots=8
I get this error after attempting mpirun -np 32 --hostfile hostfile
ssh: Could not resolve hostname Node4: Name or service not known.
But, I can log into Node4 using ssh from the same shell by using ssh
Node4.
Post by Victor
Post by Victor
Also if I mix up the hostfile like this for example and place Node1 to
Node4 slots=8 max-slots=8
Node2 slots=8 max-slots=8
Node3 slots=8 max-slots=8
Node1 slots=8 max-slots=8
The error becomes
ssh: Could not resolve hostname Node1: Name or service not known.
Node1 slots=8 max-slots=8
Node4 slots=8 max-slots=8
Node2 slots=8 max-slots=8
There is no error with three nodes even though both Node1 and Node4
"cannot be found" if they are present in a 4 node hostfile in the last
spot. The last slot seems to be bugged.
Post by Victor
Post by Victor
What is going on? How do I fix this?
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Jeff Squyres (jsquyres)
2014-03-12 12:44:23 UTC
Permalink
Can you verify that for all 4 nodes? I.e., something like this:

foreach node (Node1 Node2 Node3 Node4)
foreach other (Node1 Node2 Node3 Node 4)
echo from $node to $other
ssh $node ssh $other hostname
Yes they are. Can resolve and log into each node, from each node, using their "friendly" name, not IP.
Are all names resolvable from all servers?
I.e., if you "ssh Node4" from Node1, Node2, and Node3, does it work?
Hostname.... no I use lower case, but for some reason while I was writing the email I thought that upper case is clearer...
The same version of Ubuntu (12.04 x64) is on all nodes and openmpi and the executable are shared via nfs.
Hi,
Post by Victor
I am using openmpi 1.7.4 on Ubuntu 12.04 x64 and I have a very odd problem.
I have 4 nodes, all of which are defined in the hostfile and in /etc/hosts.
I can log into each node using ssh and certificate method from the shell that is running the mpi job, by sing their name as defined in /etc/hosts.
Node1 slots=8 max-slots=8
Node2 slots=8 max-slots=8
Node3 slots=8 max-slots=8
You are using an uppercase name here by intention - this is the one the host returns by `hostname`? Although it is allowed and should be mangled to lowercase resp. ignored for hostname resolution, I found that not all programs are doing it. Best is to use only lowercase characters is my experience.
The same version of your Ubuntu Linux is installed on all machines?
-- Reuti
Post by Victor
Node1 slots=8 max-slots=8
Node2 slots=8 max-slots=8
Node3 slots=8 max-slots=8
Node4 slots=8 max-slots=8
ssh: Could not resolve hostname Node4: Name or service not known.
But, I can log into Node4 using ssh from the same shell by using ssh Node4.
Node4 slots=8 max-slots=8
Node2 slots=8 max-slots=8
Node3 slots=8 max-slots=8
Node1 slots=8 max-slots=8
The error becomes
ssh: Could not resolve hostname Node1: Name or service not known.
Node1 slots=8 max-slots=8
Node4 slots=8 max-slots=8
Node2 slots=8 max-slots=8
There is no error with three nodes even though both Node1 and Node4 "cannot be found" if they are present in a 4 node hostfile in the last spot. The last slot seems to be bugged.
What is going on? How do I fix this?
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
***@cisco.com
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Loading...