Maksym Planeta
2018-06-18 22:52:58 UTC
Hello,
I want to force OpenMPI to use TCP and in particular use a particular subnet. Unfortunately, I can't manage to do that.
Here is what I try:
$BIN/mpirun --mca pml ob1 --mca btl tcp,self --mca ptl_tcp_remote_connections 1 --mca btl_tcp_if_include '10.233.0.0/19' -np 4 --oversubscribe -H ib1n,ib2n bash -c 'echo $PMIX_SERVER_URI2'
The expected result would be a list of IP addresses in 10.233.0.0 subnet, but instead I get this:
2659516416.2;tcp4://127.0.0.1:46777
2659516416.2;tcp4://127.0.0.1:46777
2659516416.1;tcp4://127.0.0.1:45055
2659516416.1;tcp4://127.0.0.1:45055
Could you help me to debug this problem somehow?
The IP addresses are completely available in the desired subnet
$BIN/mpirun --mca pml ob1 --mca btl tcp,self --mca ptl_tcp_remote_connections 1 --mca btl_tcp_if_include '10.233.0.0/19' -np 4 --oversubscribe -H ib1n,ib2n ip addr show dev br0
Returns a set of bridges looking like:
9: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 94:de:80:ba:37:e4 brd ff:ff:ff:ff:ff:ff
inet 141.76.49.17/26 brd 141.76.49.63 scope global br0
valid_lft forever preferred_lft forever
inet 10.233.0.82/19 scope global br0
valid_lft forever preferred_lft forever
inet6 2002:8d4c:3001:48:40de:80ff:feba:37e4/64 scope global deprecated mngtmpaddr dynamic
valid_lft 59528sec preferred_lft 0sec
inet6 fe80::96de:80ff:feba:37e4/64 scope link tentative dadfailed
valid_lft forever preferred_lft forever
<three overs are similar>
What is more boggling is that if I attache with a debugger at opal/mca/pmix/pmix3x/pmix/src/mca/ptl/tcp/ptl_tcp_components.c around line 500 I see that mca_ptl_tcp_component.remote_connections is false. This means that the way I set up component parameters is ignored.
I want to force OpenMPI to use TCP and in particular use a particular subnet. Unfortunately, I can't manage to do that.
Here is what I try:
$BIN/mpirun --mca pml ob1 --mca btl tcp,self --mca ptl_tcp_remote_connections 1 --mca btl_tcp_if_include '10.233.0.0/19' -np 4 --oversubscribe -H ib1n,ib2n bash -c 'echo $PMIX_SERVER_URI2'
The expected result would be a list of IP addresses in 10.233.0.0 subnet, but instead I get this:
2659516416.2;tcp4://127.0.0.1:46777
2659516416.2;tcp4://127.0.0.1:46777
2659516416.1;tcp4://127.0.0.1:45055
2659516416.1;tcp4://127.0.0.1:45055
Could you help me to debug this problem somehow?
The IP addresses are completely available in the desired subnet
$BIN/mpirun --mca pml ob1 --mca btl tcp,self --mca ptl_tcp_remote_connections 1 --mca btl_tcp_if_include '10.233.0.0/19' -np 4 --oversubscribe -H ib1n,ib2n ip addr show dev br0
Returns a set of bridges looking like:
9: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 94:de:80:ba:37:e4 brd ff:ff:ff:ff:ff:ff
inet 141.76.49.17/26 brd 141.76.49.63 scope global br0
valid_lft forever preferred_lft forever
inet 10.233.0.82/19 scope global br0
valid_lft forever preferred_lft forever
inet6 2002:8d4c:3001:48:40de:80ff:feba:37e4/64 scope global deprecated mngtmpaddr dynamic
valid_lft 59528sec preferred_lft 0sec
inet6 fe80::96de:80ff:feba:37e4/64 scope link tentative dadfailed
valid_lft forever preferred_lft forever
<three overs are similar>
What is more boggling is that if I attache with a debugger at opal/mca/pmix/pmix3x/pmix/src/mca/ptl/tcp/ptl_tcp_components.c around line 500 I see that mca_ptl_tcp_component.remote_connections is false. This means that the way I set up component parameters is ignored.
--
Regards,
Maksym Planeta
Regards,
Maksym Planeta