Victor
2014-03-12 06:37:25 UTC
I am using openmpi 1.7.4 on Ubuntu 12.04 x64 and I have a very odd problem.
I have 4 nodes, all of which are defined in the hostfile and in /etc/hosts.
I can log into each node using ssh and certificate method from the shell
that is running the mpi job, by sing their name as defined in /etc/hosts.
I can run an mpi job if I include only 3 nodes in the hostfile, for example:
Node1 slots=8 max-slots=8
Node2 slots=8 max-slots=8
Node3 slots=8 max-slots=8
But if I add a fourth node into the hostfile eg:
Node1 slots=8 max-slots=8
Node2 slots=8 max-slots=8
Node3 slots=8 max-slots=8
Node4 slots=8 max-slots=8
I get this error after attempting mpirun -np 32 --hostfile hostfile a.out:
ssh: Could not resolve hostname Node4: Name or service not known.
But, I can log into Node4 using ssh from the same shell by using ssh Node4.
Also if I mix up the hostfile like this for example and place Node1 to the
last spot:
Node4 slots=8 max-slots=8
Node2 slots=8 max-slots=8
Node3 slots=8 max-slots=8
Node1 slots=8 max-slots=8
The error becomes
ssh: Could not resolve hostname Node1: Name or service not known.
If I then go back to the three node hostfile like this:
Node1 slots=8 max-slots=8
Node4 slots=8 max-slots=8
Node2 slots=8 max-slots=8
There is no error with three nodes even though both Node1 and Node4 "cannot
be found" if they are present in a 4 node hostfile in the last spot. The
last slot seems to be bugged.
What is going on? How do I fix this?
I have 4 nodes, all of which are defined in the hostfile and in /etc/hosts.
I can log into each node using ssh and certificate method from the shell
that is running the mpi job, by sing their name as defined in /etc/hosts.
I can run an mpi job if I include only 3 nodes in the hostfile, for example:
Node1 slots=8 max-slots=8
Node2 slots=8 max-slots=8
Node3 slots=8 max-slots=8
But if I add a fourth node into the hostfile eg:
Node1 slots=8 max-slots=8
Node2 slots=8 max-slots=8
Node3 slots=8 max-slots=8
Node4 slots=8 max-slots=8
I get this error after attempting mpirun -np 32 --hostfile hostfile a.out:
ssh: Could not resolve hostname Node4: Name or service not known.
But, I can log into Node4 using ssh from the same shell by using ssh Node4.
Also if I mix up the hostfile like this for example and place Node1 to the
last spot:
Node4 slots=8 max-slots=8
Node2 slots=8 max-slots=8
Node3 slots=8 max-slots=8
Node1 slots=8 max-slots=8
The error becomes
ssh: Could not resolve hostname Node1: Name or service not known.
If I then go back to the three node hostfile like this:
Node1 slots=8 max-slots=8
Node4 slots=8 max-slots=8
Node2 slots=8 max-slots=8
There is no error with three nodes even though both Node1 and Node4 "cannot
be found" if they are present in a 4 node hostfile in the last spot. The
last slot seems to be bugged.
What is going on? How do I fix this?