Discussion:
[OMPI users] Disable network interface selection
carlos aguni
2018-06-22 23:36:03 UTC
Permalink
Hi all,

I'm trying to run a code on 2 machines that has at least 2 network
interfaces in it.
So I have them as described below:


compute01

compute02

ens3

192.168.100.104/24

10.0.0.227/24

ens8

10.0.0.228/24

172.21.1.128/24

ens9

172.21.1.155/24

---

Issue is. When I execute `mpirun -n 2 -host compute01,compute02 hostname`
on them what I get is the correct output after a very long delay..

What I've read so far is that OpenMPI performs a greedy algorithm on each
interface that timeouts if it doesn't find the desired IP.
Then I saw here (https://www.open-mpi.org/faq/?category=tcp#tcp-selection)
that I can run commands like:
`$ mpirun -n 2 --mca oob_tcp_if_include 10.0.0.0/24 -n 2 -host
compute01,compute02 hosname`
But this configuration doesn't reach the other host(s).
In the end I sometimes I get the same timeout.

So is there a way to let it to use the system's default route?

Regards,
Carlos.
Jeff Squyres (jsquyres) via users
2018-06-23 00:43:37 UTC
Permalink
I'm trying to run a code on 2 machines that has at least 2 network interfaces in it.
compute01
compute02
ens3
192.168.100.104/24
10.0.0.227/24
ens8
10.0.0.228/24
172.21.1.128/24
ens9
172.21.1.155/24
---
Issue is. When I execute `mpirun -n 2 -host compute01,compute02 hostname` on them what I get is the correct output after a very long delay..
What I've read so far is that OpenMPI performs a greedy algorithm on each interface that timeouts if it doesn't find the desired IP.
`$ mpirun -n 2 --mca oob_tcp_if_include 10.0.0.0/24 -n 2 -host compute01,compute02 hosname`
But this configuration doesn't reach the other host(s).
There's actually 2 different uses of TCP in Open MPI: the MPI communications and the runtime communications.

In your scenario, the MPI communications should probably "just figure it out" (since you have 2 interfaces on the same subnets on each machine). It can do this because the runtime has already established, and -- for lack of a longer explanation -- it can do very speedy discovery and interface matching.

But the runtime has nothing else to refer to, and it has to do its own discovery with no prior knowledge of anything. This is where the timeouts come in.

What you described above -- setting oob_tcp_if_include to the 10.0.0.0/24 network -- *should* work. It's a little surprising that it does not.

Can you run with:

mpirun -np 2 --mca oob_tcp_if_include 10.0.0.0/24 --mca oob_base_verbose 100 -host compute01,compute02 hostname

And see what it shows us?
In the end I sometimes I get the same timeout.
So is there a way to let it to use the system's default route?
Yes and no. The problem is that in HPC environments, the default IP route is not always in the same direction as the nodes on which you're trying to run (i.e., there's a zillion different ways to setup the IP networking, and Open MPI uses tend to use a lot of different ones...).
--
Jeff Squyres
***@cisco.com
Gilles Gouaillardet
2018-06-23 02:31:57 UTC
Permalink
Carlos,

By any chance, could

mpirun—mca oob_tcp_if_exclude 192.168.100.0/24 ...

work for you ?

Which Open MPI version are you running ?


IIRC, subnets are internally translated to interfaces, so that might be an
issue if
the translation if made on the first host, and then the interface name is
sent to the other hosts.

Cheers,

Gilles
Post by carlos aguni
Hi all,
I'm trying to run a code on 2 machines that has at least 2 network
interfaces in it.
compute01
compute02
ens3
192.168.100.104/24
10.0.0.227/24
ens8
10.0.0.228/24
172.21.1.128/24
ens9
172.21.1.155/24
---
Issue is. When I execute `mpirun -n 2 -host compute01,compute02 hostname`
on them what I get is the correct output after a very long delay..
What I've read so far is that OpenMPI performs a greedy algorithm on each
interface that timeouts if it doesn't find the desired IP.
Then I saw here (https://www.open-mpi.org/faq/?category=tcp#tcp-selection)
`$ mpirun -n 2 --mca oob_tcp_if_include 10.0.0.0/24 -n 2 -host
compute01,compute02 hosname`
But this configuration doesn't reach the other host(s).
In the end I sometimes I get the same timeout.
So is there a way to let it to use the system's default route?
Regards,
Carlos.
r***@open-mpi.org
2018-06-23 03:25:46 UTC
Permalink
Post by Gilles Gouaillardet
Carlos,
By any chance, could
mpirun—mca oob_tcp_if_exclude 192.168.100.0/24 <http://192.168.100.0/24> ...
work for you ?
Which Open MPI version are you running ?
IIRC, subnets are internally translated to interfaces, so that might be an issue if
the translation if made on the first host, and then the interface name is sent to the other hosts.
FWIW: we never send interface names to other hosts - just dot addresses
Post by Gilles Gouaillardet
Cheers,
Gilles
Hi all,
I'm trying to run a code on 2 machines that has at least 2 network interfaces in it.
compute01
compute02
ens3
192.168.100.104/24 <http://192.168.100.104/24>
10.0.0.227/24 <http://10.0.0.227/24>
ens8
10.0.0.228/24 <http://10.0.0.228/24>
172.21.1.128/24 <http://172.21.1.128/24>
ens9
172.21.1.155/24 <http://172.21.1.155/24>
---
Issue is. When I execute `mpirun -n 2 -host compute01,compute02 hostname` on them what I get is the correct output after a very long delay..
What I've read so far is that OpenMPI performs a greedy algorithm on each interface that timeouts if it doesn't find the desired IP.
`$ mpirun -n 2 --mca oob_tcp_if_include 10.0.0.0/24 <http://10.0.0.0/24> -n 2 -host compute01,compute02 hosname`
But this configuration doesn't reach the other host(s).
In the end I sometimes I get the same timeout.
So is there a way to let it to use the system's default route?
Regards,
Carlos.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
r***@open-mpi.org
2018-06-23 03:27:31 UTC
Permalink
Post by r***@open-mpi.org
Post by Gilles Gouaillardet
Carlos,
By any chance, could
mpirun—mca oob_tcp_if_exclude 192.168.100.0/24 <http://192.168.100.0/24> ...
work for you ?
Which Open MPI version are you running ?
IIRC, subnets are internally translated to interfaces, so that might be an issue if
the translation if made on the first host, and then the interface name is sent to the other hosts.
FWIW: we never send interface names to other hosts - just dot addresses
Should have clarified - when you specify an interface name for the MCA param, then it is the interface name that is transferred as that is the value of the MCA param. However, once we determine our address, we only transfer dot addresses between ourselves
Post by r***@open-mpi.org
Post by Gilles Gouaillardet
Cheers,
Gilles
Hi all,
I'm trying to run a code on 2 machines that has at least 2 network interfaces in it.
compute01
compute02
ens3
192.168.100.104/24 <http://192.168.100.104/24>
10.0.0.227/24 <http://10.0.0.227/24>
ens8
10.0.0.228/24 <http://10.0.0.228/24>
172.21.1.128/24 <http://172.21.1.128/24>
ens9
172.21.1.155/24 <http://172.21.1.155/24>
---
Issue is. When I execute `mpirun -n 2 -host compute01,compute02 hostname` on them what I get is the correct output after a very long delay..
What I've read so far is that OpenMPI performs a greedy algorithm on each interface that timeouts if it doesn't find the desired IP.
`$ mpirun -n 2 --mca oob_tcp_if_include 10.0.0.0/24 <http://10.0.0.0/24> -n 2 -host compute01,compute02 hosname`
But this configuration doesn't reach the other host(s).
In the end I sometimes I get the same timeout.
So is there a way to let it to use the system's default route?
Regards,
Carlos.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
carlos aguni
2018-06-23 20:34:48 UTC
Permalink
Hi!

Thank you all for your reply Jeff, Gilles and rhc.

Thank you Jeff and rhc for clarifying to me some of the openmpi's internals.
Post by r***@open-mpi.org
Post by r***@open-mpi.org
FWIW: we never send interface names to other hosts - just dot addresses
Should have clarified - when you specify an interface name for the MCA
param, then it is the interface name that is transferred as that is the
value of the MCA param. However, once we determine our address, we only
transfer dot addresses between ourselves

If only dot addresses are sent to the hosts then why doesn't openmpi use
the default route like `ip route get <other host IP>` instead of choosing a
random one? Is it an expected behaviour? Can it be changed?

Sorry. As Gilles pointed out I forgot to mention which openmpi version I
was using. I'm using openmpi 3.0.0 gcc 7.3.0 from openhpc. Centos 7.5.
Post by r***@open-mpi.org
mpirun—mca oob_tcp_if_exclude 192.168.100.0/24 ...
I cannot just exclude that interface cause after that I want to add another
computer that's on a different network. And this is where things get messy
:( I cannot just include and exclude networks cause I have different
machines on different networks.
This is what I want to achieve:


compute01

compute02

compute03

ens3

192.168.100.104/24

10.0.0.227/24

192.168.100.105/24

ens8

10.0.0.228/24

172.21.1.128/24

---

ens9

172.21.1.155/24

---

---

So I'm in compute01 MPI_spawning another process on compute02 and compute03.
With both MPI_Spawn and `mpirun -n 3 -host compute01,compute02,compute03
hostname`

Then when I include the mca parameters I get this:
`mpirun --oversubscribe --allow-run-as-root -n 3 --mca oob_tcp_if_include
10.0.0.0/24,192.168.100.0/24 -host compute01,compute02,compute03 hostname`
WARNING: An invalid value was given for oob_tcp_if_include. This value
will be ignored.
...
Message: Did not find interface matching this subnet

This would all work if it were to use the system's internals like `ip
route`.

Best regards,
Carlos.
carlos aguni
2018-06-28 19:08:08 UTC
Permalink
Hi!

Thank you all for your reply Jeff, Gilles and rhc.

Thank you Jeff and rhc for clarifying to me some of the openmpi's internals.
Post by r***@open-mpi.org
Post by r***@open-mpi.org
FWIW: we never send interface names to other hosts - just dot addresses
Should have clarified - when you specify an interface name for the MCA
param, then it is the interface name that is transferred as that is the
value of the MCA param. However, once we determine our address, we only
transfer dot addresses between ourselves

If only dot addresses are sent to the hosts then why doesn't openmpi use
the default route like `ip route get <other host IP>` instead of choosing a
random one? Is it an expected behaviour? Can it be changed?

Sorry. As Gilles pointed out I forgot to mention which openmpi version I
was using. I'm using openmpi 3.0.0 gcc 7.3.0 from openhpc. Centos 7.5.
Post by r***@open-mpi.org
mpirun—mca oob_tcp_if_exclude 192.168.100.0/24 ...
I cannot just exclude that interface cause after that I want to add another
computer that's on a different network. And this is where things get messy
:( I cannot just include and exclude networks cause I have different
machines on different networks.
This is what I want to achieve:


compute01

compute02

compute03

ens3

192.168.100.104/24

10.0.0.227/24

192.168.100.105/24

ens8

10.0.0.228/24

172.21.1.128/24

---

ens9

172.21.1.155/24

---

---

So I'm in compute01 MPI_spawning another process on compute02 and compute03.
With both MPI_Spawn and `mpirun -n 3 -host compute01,compute02,compute03
hostname`

Then when I include the mca parameters I get this:
`mpirun --oversubscribe --allow-run-as-root -n 3 --mca oob_tcp_if_include
10.0.0.0/24,192.168.100.0/24 -host compute01,compute02,compute03 hostname`
WARNING: An invalid value was given for oob_tcp_if_include. This value
will be ignored.
...
Message: Did not find interface matching this subnet

This would all work if it were to use the system's internals like `ip
route`.

Best regards,
Carlos.
Post by r***@open-mpi.org
On Jun 22, 2018, at 7:31 PM, Gilles Gouaillardet <
Carlos,
By any chance, could
mpirun—mca oob_tcp_if_exclude 192.168.100.0/24 ...
work for you ?
Which Open MPI version are you running ?
IIRC, subnets are internally translated to interfaces, so that might be an
issue if
the translation if made on the first host, and then the interface name is
sent to the other hosts.
FWIW: we never send interface names to other hosts - just dot addresses
Should have clarified - when you specify an interface name for the MCA
param, then it is the interface name that is transferred as that is the
value of the MCA param. However, once we determine our address, we only
transfer dot addresses between ourselves
Cheers,
Gilles
Post by r***@open-mpi.org
Hi all,
I'm trying to run a code on 2 machines that has at least 2 network
interfaces in it.
compute01
compute02
ens3
192.168.100.104/24
10.0.0.227/24
ens8
10.0.0.228/24
172.21.1.128/24
ens9
172.21.1.155/24
---
Issue is. When I execute `mpirun -n 2 -host compute01,compute02 hostname`
on them what I get is the correct output after a very long delay..
What I've read so far is that OpenMPI performs a greedy algorithm on each
interface that timeouts if it doesn't find the desired IP.
Then I saw here (https://www.open-mpi.org/faq/?category=tcp#tcp-selection)
`$ mpirun -n 2 --mca oob_tcp_if_include 10.0.0.0/24 -n 2 -host
compute01,compute02 hosname`
But this configuration doesn't reach the other host(s).
In the end I sometimes I get the same timeout.
So is there a way to let it to use the system's default route?
Regards,
Carlos.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
carlos aguni
2018-06-28 19:10:26 UTC
Permalink
Just realized my email wasn't sent to the archive.
Post by carlos aguni
Hi!
Thank you all for your reply Jeff, Gilles and rhc.
Thank you Jeff and rhc for clarifying to me some of the openmpi's
internals.
Post by r***@open-mpi.org
Post by r***@open-mpi.org
FWIW: we never send interface names to other hosts - just dot addresses
Should have clarified - when you specify an interface name for the MCA
param, then it is the interface name that is transferred as that is the
value of the MCA param. However, once we determine our address, we only
transfer dot addresses between ourselves
If only dot addresses are sent to the hosts then why doesn't openmpi use
the default route like `ip route get <other host IP>` instead of choosing a
random one? Is it an expected behaviour? Can it be changed?
Sorry. As Gilles pointed out I forgot to mention which openmpi version I
was using. I'm using openmpi 3.0.0 gcc 7.3.0 from openhpc. Centos 7.5.
Post by r***@open-mpi.org
mpirun—mca oob_tcp_if_exclude 192.168.100.0/24 ...
I cannot just exclude that interface cause after that I want to add
another computer that's on a different network. And this is where things
get messy :( I cannot just include and exclude networks cause I have
different machines on different networks.
compute01
compute02
compute03
ens3
192.168.100.104/24
10.0.0.227/24
192.168.100.105/24
ens8
10.0.0.228/24
172.21.1.128/24
---
ens9
172.21.1.155/24
---
---
So I'm in compute01 MPI_spawning another process on compute02 and
compute03.
With both MPI_Spawn and `mpirun -n 3 -host compute01,compute02,compute03
hostname`
`mpirun --oversubscribe --allow-run-as-root -n 3 --mca oob_tcp_if_include
10.0.0.0/24,192.168.100.0/24 -host compute01,compute02,compute03 hostname`
WARNING: An invalid value was given for oob_tcp_if_include. This value
will be ignored.
...
Message: Did not find interface matching this subnet
This would all work if it were to use the system's internals like `ip
route`.
Best regards,
Carlos.
Gilles Gouaillardet
2018-07-02 00:01:08 UTC
Permalink
Carlos,


Open MPI 3.0.2 has been released, and it contains several bug fixes, so I do

encourage you to upgrade and try again.



if it still does not work, can you please run

mpirun --mca oob_base_verbose 10 ...

and then compress and post the output ?


out of curiosity, would

mpirun --mca routed_radix 1 ...

work in your environment ?


once we can analyze the logs, we should be able to figure out what is
going wrong.


Cheers,

Gilles
Post by carlos aguni
Just realized my email wasn't sent to the archive.
Hi!
Thank you all for your reply Jeff, Gilles and rhc.
Thank you Jeff and rhc for clarifying to me some of the openmpi's
internals.
Post by r***@open-mpi.org
Post by r***@open-mpi.org
FWIW: we never send interface names to other hosts - just dot
addresses
Post by r***@open-mpi.org
Should have clarified - when you specify an interface name for the
MCA param, then it is the interface name that is transferred as
that is the value of the MCA param. However, once we determine our
address, we only transfer dot addresses between ourselves
If only dot addresses are sent to the hosts then why doesn't
openmpi use the default route like `ip route get <other host IP>`
instead of choosing a random one? Is it an expected behaviour? Can
it be changed?
Sorry. As Gilles pointed out I forgot to mention which openmpi
version I was using. I'm using openmpi 3.0.0 gcc 7.3.0 from
openhpc. Centos 7.5.
Post by r***@open-mpi.org
mpirun—mca oob_tcp_if_exclude192.168.100.0/24
<http://192.168.100.0/24>...
I cannot just exclude that interface cause after that I want to
add another computer that's on a different network. And this is
where things get messy :( I cannot just include and exclude
networks cause I have different machines on different networks.
compute01
compute02
compute03
ens3
192.168.100.104/24 <http://192.168.100.104/24>
10.0.0.227/24 <http://10.0.0.227/24>
192.168.100.105/24 <http://192.168.100.105/24>
ens8
10.0.0.228/24 <http://10.0.0.228/24>
172.21.1.128/24 <http://172.21.1.128/24>
---
ens9
172.21.1.155/24 <http://172.21.1.155/24>
---
---
So I'm in compute01 MPI_spawning another process on compute02 and
compute03.
With both MPI_Spawn and `mpirun -n 3 -host
compute01,compute02,compute03 hostname`
`mpirun --oversubscribe --allow-run-as-root -n 3 --mca
oob_tcp_if_include 10.0.0.0/24,192.168.100.0/24
<http://10.0.0.0/24,192.168.100.0/24> -host
compute01,compute02,compute03 hostname`
WARNING: An invalid value was given for oob_tcp_if_include. This
value will be ignored.
...
Message:    Did not find interface matching this subnet
This would all work if it were to use the system's internals like
`ip route`.
Best regards,
Carlos.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
carlos aguni
2018-07-04 20:36:30 UTC
Permalink
Hi Gilles.

Thank you for your reply! :)
I'm now using a compiled version of OpenMPI 3.0.2 and all seems to work
fine now.
Running `mpirun -n 3 -host c01,c02,c03 hostname` i get:
c01
c02
c03

`mpirun -n 2 -host c01,c02 hostname`:
c02
c01

`mpirun -n 2 -host c01,c03 hostname`:
c01
c03

Which is expected.

Now when I run a MPI_Spawn it prints out a warning message which refers to
it getting the wrong IP.
Check the command. I'll highlight some verbose.
`mpirun -n 1 --machinefile con_c03_hostfile --mca oob_base_verbose 10
con_c03`:
Hello world from processor c01, rank 0 out of 2 processors
Im the spawned rank 0
Hello world from processor c03, rank 1 out of 2 processors
[[35996,2],0][btl_tcp_endpoint.c:755:mca_btl_tcp_endpoint_start_connect]
from c03 to: c01 Unable to connect to the peer 10.0.0.1 on port 1024:
Network is unreachable

[c03:06355] pml_ob1_sendreq.c:235 FATAL

Verbose below:
[c01:05462] [[36010,0],0] oob:tcp:init adding 10.0.0.1 to our list of V4
connections
[c01:05462] [[36010,0],0] oob:tcp:init adding 172.16.0.1 to our list of V4
connections
[c01:05462] [[36010,0],0] oob:tcp:init adding 172.21.1.136 to our list of
V4 connections
[c03:06225] [[36010,0],1] oob:tcp:init adding 192.168.0.1 to our list of V4
connections
[c03:06225] [[36010,0],1] oob:tcp:init adding 172.16.0.2 to our list of V4
connections

Is there a way to suppress it?

My env is as described below:
*c01*
ens8 10.0.0.1/24
ens9 172.16.0.1/24
eth0 172.21.1.136/24

*c02*
eth0 10.0.0.2/24

*c03*
ens8 192.168.0.1/24
eth1 172.16.0.2/24

*c04*
eth0 192.168.0.2/24

Regards,
Carlos.
Post by Gilles Gouaillardet
Carlos,
Open MPI 3.0.2 has been released, and it contains several bug fixes, so I do
encourage you to upgrade and try again.
if it still does not work, can you please run
mpirun --mca oob_base_verbose 10 ...
and then compress and post the output ?
out of curiosity, would
mpirun --mca routed_radix 1 ...
work in your environment ?
once we can analyze the logs, we should be able to figure out what is
going wrong.
Cheers,
Gilles
Post by carlos aguni
Just realized my email wasn't sent to the archive.
Hi!
Thank you all for your reply Jeff, Gilles and rhc.
Thank you Jeff and rhc for clarifying to me some of the openmpi's
internals.
Post by r***@open-mpi.org
Post by r***@open-mpi.org
FWIW: we never send interface names to other hosts - just dot
addresses
Post by r***@open-mpi.org
Should have clarified - when you specify an interface name for the
MCA param, then it is the interface name that is transferred as
that is the value of the MCA param. However, once we determine our
address, we only transfer dot addresses between ourselves
If only dot addresses are sent to the hosts then why doesn't
openmpi use the default route like `ip route get <other host IP>`
instead of choosing a random one? Is it an expected behaviour? Can
it be changed?
Sorry. As Gilles pointed out I forgot to mention which openmpi
version I was using. I'm using openmpi 3.0.0 gcc 7.3.0 from
openhpc. Centos 7.5.
Post by r***@open-mpi.org
mpirun—mca oob_tcp_if_exclude192.168.100.0/24
<http://192.168.100.0/24>...
I cannot just exclude that interface cause after that I want to
add another computer that's on a different network. And this is
where things get messy :( I cannot just include and exclude
networks cause I have different machines on different networks.
compute01
compute02
compute03
ens3
192.168.100.104/24 <http://192.168.100.104/24>
10.0.0.227/24 <http://10.0.0.227/24>
192.168.100.105/24 <http://192.168.100.105/24>
ens8
10.0.0.228/24 <http://10.0.0.228/24>
172.21.1.128/24 <http://172.21.1.128/24>
---
ens9
172.21.1.155/24 <http://172.21.1.155/24>
---
---
So I'm in compute01 MPI_spawning another process on compute02 and
compute03.
With both MPI_Spawn and `mpirun -n 3 -host
compute01,compute02,compute03 hostname`
`mpirun --oversubscribe --allow-run-as-root -n 3 --mca
oob_tcp_if_include 10.0.0.0/24,192.168.100.0/24
<http://10.0.0.0/24,192.168.100.0/24> -host
compute01,compute02,compute03 hostname`
WARNING: An invalid value was given for oob_tcp_if_include. This
value will be ignored.
...
Message: Did not find interface matching this subnet
This would all work if it were to use the system's internals like
`ip route`.
Best regards,
Carlos.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Jeff Squyres (jsquyres) via users
2018-07-10 00:23:47 UTC
Permalink
Can you send the full verbose output with "--mca btl_base_verbose 100"?
Post by carlos aguni
Hi Gilles.
Thank you for your reply! :)
I'm now using a compiled version of OpenMPI 3.0.2 and all seems to work fine now.
c01
c02
c03
c02
c01
c01
c03
Which is expected.
Now when I run a MPI_Spawn it prints out a warning message which refers to it getting the wrong IP.
Check the command. I'll highlight some verbose.
Hello world from processor c01, rank 0 out of 2 processors
Im the spawned rank 0
Hello world from processor c03, rank 1 out of 2 processors
[[35996,2],0][btl_tcp_endpoint.c:755:mca_btl_tcp_endpoint_start_connect] from c03 to: c01 Unable to connect to the peer 10.0.0.1 on port 1024: Network is unreachable
[c03:06355] pml_ob1_sendreq.c:235 FATAL
[c01:05462] [[36010,0],0] oob:tcp:init adding 10.0.0.1 to our list of V4 connections
[c01:05462] [[36010,0],0] oob:tcp:init adding 172.16.0.1 to our list of V4 connections
[c01:05462] [[36010,0],0] oob:tcp:init adding 172.21.1.136 to our list of V4 connections
[c03:06225] [[36010,0],1] oob:tcp:init adding 192.168.0.1 to our list of V4 connections
[c03:06225] [[36010,0],1] oob:tcp:init adding 172.16.0.2 to our list of V4 connections
Is there a way to suppress it?
c01
ens8 10.0.0.1/24
ens9 172.16.0.1/24
eth0 172.21.1.136/24
c02
eth0 10.0.0.2/24
c03
ens8 192.168.0.1/24
eth1 172.16.0.2/24
c04
eth0 192.168.0.2/24
Regards,
Carlos.
Carlos,
Open MPI 3.0.2 has been released, and it contains several bug fixes, so I do
encourage you to upgrade and try again.
if it still does not work, can you please run
mpirun --mca oob_base_verbose 10 ...
and then compress and post the output ?
out of curiosity, would
mpirun --mca routed_radix 1 ...
work in your environment ?
once we can analyze the logs, we should be able to figure out what is going wrong.
Cheers,
Gilles
Just realized my email wasn't sent to the archive.
Hi!
Thank you all for your reply Jeff, Gilles and rhc.
Thank you Jeff and rhc for clarifying to me some of the openmpi's
internals.
Post by r***@open-mpi.org
Post by r***@open-mpi.org
FWIW: we never send interface names to other hosts - just dot
addresses
Post by r***@open-mpi.org
Should have clarified - when you specify an interface name for the
MCA param, then it is the interface name that is transferred as
that is the value of the MCA param. However, once we determine our
address, we only transfer dot addresses between ourselves
If only dot addresses are sent to the hosts then why doesn't
openmpi use the default route like `ip route get <other host IP>`
instead of choosing a random one? Is it an expected behaviour? Can
it be changed?
Sorry. As Gilles pointed out I forgot to mention which openmpi
version I was using. I'm using openmpi 3.0.0 gcc 7.3.0 from
openhpc. Centos 7.5.
Post by r***@open-mpi.org
mpirun—mca oob_tcp_if_exclude192.168.100.0/24
<http://192.168.100.0/24>...
I cannot just exclude that interface cause after that I want to
add another computer that's on a different network. And this is
where things get messy :( I cannot just include and exclude
networks cause I have different machines on different networks.
compute01
compute02
compute03
ens3
192.168.100.104/24 <http://192.168.100.104/24>
10.0.0.227/24 <http://10.0.0.227/24>
192.168.100.105/24 <http://192.168.100.105/24>
ens8
10.0.0.228/24 <http://10.0.0.228/24>
172.21.1.128/24 <http://172.21.1.128/24>
---
ens9
172.21.1.155/24 <http://172.21.1.155/24>
---
---
So I'm in compute01 MPI_spawning another process on compute02 and
compute03.
With both MPI_Spawn and `mpirun -n 3 -host
compute01,compute02,compute03 hostname`
`mpirun --oversubscribe --allow-run-as-root -n 3 --mca
oob_tcp_if_include 10.0.0.0/24,192.168.100.0/24
<http://10.0.0.0/24,192.168.100.0/24> -host
compute01,compute02,compute03 hostname`
WARNING: An invalid value was given for oob_tcp_if_include. This
value will be ignored.
...
Message: Did not find interface matching this subnet
This would all work if it were to use the system's internals like
`ip route`.
Best regards,
Carlos.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
***@cisco.com

Loading...