Discussion:
[OMPI users] Open MPI in a Infiniband dual-rail configuration issues
Ludovic Raess
2017-07-19 12:20:38 UTC
Permalink
Hi,

We have an issue on our 32 nodes Linux cluster regarding the usage of Open MPI in a Infiniband dual-rail configuration.

Node config:
- Supermicro dual socket Xeon E5 v3 6 cores CPUs
- 4 Titan X GPUs
- 2 IB Connect X FDR single port HCA (mlx4_0 and mlx4_1)
- Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7

IB dual rail configuration: two independent IB switches (36 ports), each of the two single port IB HCA is connected to its own IB subnet.

The nodes are additionally connected via Ethernet for admin.

------------------------------------------------------------

Consider the node topology below as being valid for every of the 32 nodes from the cluster:

At the PCIe root complex level, each CPU manages two GPUs and a single IB card :
CPU0 | CPU1
mlx4_0 | mlx4_1
GPU0 | GPU2
GPU1 | GPU3

MPI ranks are bounded to a socket via a rankfile and are distributed on the 2 sockets of each node :
rank 0=node01 slot=0:2
rank 1=node01 slot=1:2
rank 2=node02 slot=0:2
...
rank n=nodeNN slot=0,1:2


case 1: with a single IB HCA used (any one of the two), all ranks can communicate with each other via
openib only, and this independently of their relative socket binding. The use of tcp btl can be
explicitly disabled as there is no tcp traffic.

"mpirun -rf rankfile --mca btl_openib_if_include mlx4_0 --mca btl self,openib a.out"

case 2: in some rare cases, the topology of our MPI job is such that processes on socket 0 communicate only with
other processes on socket 0 and the same is true for processes on socket 1. In this context, the two IB rails
are effectively used in parallel and all ranks communicate as needed via openib only, no tcp traffic.

"mpirun -rf rankfile --mca btl_openib_if_include mlx4_0,mlx4_1 --mca btl self,openib a.out"

case 3: most of the time we have "cross socket" communications between ranks on different nodes.
In this context Open MPI reverts to using tcp when communications involve even and odd sockets,
and it slows down our jobs.

mpirun -rf rankfile --mca btl_openib_if_include mlx4_0,mlx4_1 a.out
[node01.octopoda:16129] MCW rank 0 bound to socket 0[core 2[hwt 0]]: [././B/././.][./././././.]
[node02.octopoda:12061] MCW rank 1 bound to socket 1[core 10[hwt 0]]: [./././././.][././././B/.]
[node02.octopoda:12062] [rank=1] openib: skipping device mlx4_0; it is too far away
[node01.octopoda:16130] [rank=0] openib: skipping device mlx4_1; it is too far away
[node02.octopoda:12062] [rank=1] openib: using port mlx4_1:1
[node01.octopoda:16130] [rank=0] openib: using port mlx4_0:1
[node02.octopoda:12062] mca: bml: Using self btl to [[11337,1],1] on node node02
[node01.octopoda:16130] mca: bml: Using self btl to [[11337,1],0] on node node01
[node02.octopoda:12062] mca: bml: Using tcp btl to [[11337,1],0] on node node01
[node02.octopoda:12062] mca: bml: Using tcp btl to [[11337,1],0] on node node01
[node02.octopoda:12062] mca: bml: Using tcp btl to [[11337,1],0] on node node01
[node01.octopoda:16130] mca: bml: Using tcp btl to [[11337,1],1] on node node02
[node01.octopoda:16130] mca: bml: Using tcp btl to [[11337,1],1] on node node02
[node01.octopoda:16130] mca: bml: Using tcp btl to [[11337,1],1] on node node02


trying to force using the two IB HCA and to disable the use of tcp btl results in the following error

mpirun -rf rankfile --mca btl_openib_if_include mlx4_0,mlx4_1 --mca btl self,openib a.out
[node02.octopoda:11818] MCW rank 1 bound to socket 1[core 10[hwt 0]]: [./././././.][././././B/.]
[node01.octopoda:15886] MCW rank 0 bound to socket 0[core 2[hwt 0]]: [././B/././.][./././././.]
[node01.octopoda:15887] [rank=0] openib: skipping device mlx4_1; it is too far away
[node02.octopoda:11819] [rank=1] openib: skipping device mlx4_0; it is too far away
[node01.octopoda:15887] [rank=0] openib: using port mlx4_0:1
[node02.octopoda:11819] [rank=1] openib: using port mlx4_1:1
[node02.octopoda:11819] mca: bml: Using self btl to [[25017,1],1] on node node02
[node01.octopoda:15887] mca: bml: Using self btl to [[25017,1],0] on node node01
-------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.

Process 1 ([[25017,1],1]) is on host: node02
Process 2 ([[25017,1],0]) is on host: node01
BTLs attempted: self openib

Your MPI job is now going to abort; sorry.
-------------------------------------
Gilles Gouaillardet
2017-07-19 12:51:48 UTC
Permalink
Ludovic,

what happens here is that by default, a MPI task will only use the
closest IB device.
since tasks are bound to a socket, that means that tasks on socket 0
will only use mlx4_0, and tasks on socket 1 will only use mlx4_1.
because these are on independent subnets, that also means that tasks
on socket 0 cannot communicate with tasks on socket 1 via the openib
btl.

so you have to explicitly direct Open MPI to use all the IB interfaces

mpirun --mca btl_openib_ignore_locality 1 ...

i do not think that will perform optimally though :-(
for this type of settings, i'd rather suggest all IB ports are on the
same subnet


Cheers,

Gilles
Post by Ludovic Raess
Hi,
We have an issue on our 32 nodes Linux cluster regarding the usage of Open
MPI in a Infiniband dual-rail configuration.
- Supermicro dual socket Xeon E5 v3 6 cores CPUs
- 4 Titan X GPUs
- 2 IB Connect X FDR single port HCA (mlx4_0 and mlx4_1)
- Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7
IB dual rail configuration: two independent IB switches (36 ports), each of
the two single port IB HCA is connected to its own IB subnet.
The nodes are additionally connected via Ethernet for admin.
------------------------------------------------------------
CPU0 | CPU1
mlx4_0 | mlx4_1
GPU0 | GPU2
GPU1 | GPU3
MPI ranks are bounded to a socket via a rankfile and are distributed on the
rank 0=node01 slot=0:2
rank 1=node01 slot=1:2
rank 2=node02 slot=0:2
...
rank n=nodeNN slot=0,1:2
case 1: with a single IB HCA used (any one of the two), all ranks can
communicate with each other via
openib only, and this independently of their relative socket
binding. The use of tcp btl can be
explicitly disabled as there is no tcp traffic.
"mpirun -rf rankfile --mca btl_openib_if_include mlx4_0 --mca btl
self,openib a.out"
case 2: in some rare cases, the topology of our MPI job is such that
processes on socket 0 communicate only with
other processes on socket 0 and the same is true for processes on
socket 1. In this context, the two IB rails
are effectively used in parallel and all ranks communicate as needed
via openib only, no tcp traffic.
"mpirun -rf rankfile --mca btl_openib_if_include mlx4_0,mlx4_1 --mca
btl self,openib a.out"
case 3: most of the time we have "cross socket" communications between ranks
on different nodes.
In this context Open MPI reverts to using tcp when communications
involve even and odd sockets,
and it slows down our jobs.
mpirun -rf rankfile --mca btl_openib_if_include mlx4_0,mlx4_1 a.out
[././B/././.][./././././.]
[./././././.][././././B/.]
[node02.octopoda:12062] [rank=1] openib: skipping device mlx4_0; it is too far away
[node01.octopoda:16130] [rank=0] openib: skipping device mlx4_1; it is too far away
[node02.octopoda:12062] [rank=1] openib: using port mlx4_1:1
[node01.octopoda:16130] [rank=0] openib: using port mlx4_0:1
[node02.octopoda:12062] mca: bml: Using self btl to [[11337,1],1] on node node02
[node01.octopoda:16130] mca: bml: Using self btl to [[11337,1],0] on node node01
[node02.octopoda:12062] mca: bml: Using tcp btl to [[11337,1],0] on node node01
[node02.octopoda:12062] mca: bml: Using tcp btl to [[11337,1],0] on node node01
[node02.octopoda:12062] mca: bml: Using tcp btl to [[11337,1],0] on node node01
[node01.octopoda:16130] mca: bml: Using tcp btl to [[11337,1],1] on node node02
[node01.octopoda:16130] mca: bml: Using tcp btl to [[11337,1],1] on node node02
[node01.octopoda:16130] mca: bml: Using tcp btl to [[11337,1],1] on node node02
trying to force using the two IB HCA and to disable the use of tcp
btl results in the following error
mpirun -rf rankfile --mca btl_openib_if_include mlx4_0,mlx4_1 --mca btl self,openib a.out
[./././././.][././././B/.]
[././B/././.][./././././.]
[node01.octopoda:15887] [rank=0] openib: skipping device mlx4_1; it is too far away
[node02.octopoda:11819] [rank=1] openib: skipping device mlx4_0; it is too far away
[node01.octopoda:15887] [rank=0] openib: using port mlx4_0:1
[node02.octopoda:11819] [rank=1] openib: using port mlx4_1:1
[node02.octopoda:11819] mca: bml: Using self btl to [[25017,1],1] on node node02
[node01.octopoda:15887] mca: bml: Using self btl to [[25017,1],0] on node node01
-------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[25017,1],1]) is on host: node02
Process 2 ([[25017,1],0]) is on host: node01
BTLs attempted: self openib
Your MPI job is now going to abort; sorry.
-------------------------------------
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Loading...