[OMPI users] Bottleneck of OpenMPI over 100Gbps ROCE

Discussion:

Lizhaogeng

2017-08-21 07:49:42 UTC

Hi all,

Sorry for resubmitting this problem because I found I didn't add the
subject in the last email.

I encountered a problem when I tested the performance of OpenMPI over ROCE
100Gbps.
I have two servers connected with mellanox 100Gbps Connect-X4 ROCE NICs on
them.
I used intel mpi benchmark to test the performance of OpenMPI (1.10.3) over
RDMA.
I found the bandwidth of benchmark pingpong (2 ranks, every server has only
one rank) could reach only 6GB/s (with openib btl).
I also used osu mpi benchmark, the bandwidth could reach only 6.5GB/s.
However, when I start two benchmarks at the same time, the total bandwidth
can reach about 11GB/s (every server has two ranks).

It seems that the CPU is the bottleneck.
Obviously, the bottleneck is not memcpy.
And RDMA itself ought not to comsume too much CPU resources, since the
perftest of ib_write_bw can reach 11GB/s easily.

Is the bandwidth limit is normal?
Is there anyone know what is the real bottleneck?

Thanks for your kindly help in advance.

Regards,
Zhaogeng

Joshua Ladd

2017-08-25 17:10:28 UTC

Permalink

Hi,

There is a known issue in ConnectX-4 which impacts RDMA_READ bandwidth with
a single QP. The overhead in the HCA of processing a single RDMA_READ
response packet is too high due to the need to lock the QP. With a small
MTU (as is the case with Ethernet packets), the impact is magnified because
the overhead is larger relative to the packet size. OpenIB BTL uses
RDMA_READ for large messages while HPC-X (MXM and UCX) use RDMA_WRITE on
ConnectX-4 to work around the issue. So, there are two path forward:

1. Install MXM or UCX ,rebuild OMPI and use then use the Yalla PML.

or

2. Enlarge the default MTU size for ethernet packets to 9K and continue
using OpenIB BTL

ifconfig p3p1 mtu 9000
mpirun -np 2 --mca btl openib,self,sm --mca btl_openib_if_include mlx5_1

--mca btl_openib_cpc_include rdmacm --mca btl_openib_use_eager_rdma 1 -mca
pml ob1 -mca mtl ^mxm --map-by dist:span -mca rmaps_dist_device mlx5_1 -mca
coll_hcoll_enable 0 -host thor033,thor034
/home/osu-micro-benchmarks-5.3.2-ompi-1.10/mpi/pt2pt/osu_bw

# OSU MPI Bandwidth Test v5.3.2

# Size Bandwidth (MB/s)

1 4.46

2 8.91

4 17.96

8 34.27

16 69.47

32 128.00

64 259.73

128 469.99

256 903.70

512 1643.33

1024 3068.16

2048 5050.94

4096 7537.21

8192 9754.20

16384 9808.94

32768 11360.52

65536 11688.18

131072 11858.37

262144 11931.31

524288 11966.42

1048576 11984.15

2097152 11981.27

4194304 11980.92

Hi all,
Sorry for resubmitting this problem because I found I didn't add the
subject in the last email.
I encountered a problem when I tested the performance of OpenMPI over ROCE
100Gbps.
I have two servers connected with mellanox 100Gbps Connect-X4 ROCE NICs on
them.
I used intel mpi benchmark to test the performance of OpenMPI (1.10.3)
over RDMA.
I found the bandwidth of benchmark pingpong (2 ranks, every server has
only one rank) could reach only 6GB/s (with openib btl).
I also used osu mpi benchmark, the bandwidth could reach only 6.5GB/s.
However, when I start two benchmarks at the same time, the total bandwidth
can reach about 11GB/s (every server has two ranks).
It seems that the CPU is the bottleneck.
Obviously, the bottleneck is not memcpy.
And RDMA itself ought not to comsume too much CPU resources, since the
perftest of ib_write_bw can reach 11GB/s easily.
Is the bandwidth limit is normal?
Is there anyone know what is the real bottleneck?
Thanks for your kindly help in advance.
Regards,
Zhaogeng
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

Lizhaogeng

2017-08-28 10:17:08 UTC

Permalink

Hi Joshua,

Thank you very much for your help.
I reply the email so late because I wanted to confirm the provided solution
with MXM before that.
However, unfortunately, I haven't used MXM correctly so far (two servers
cannot communicate with MXM).
I'll tell you if MXM solves the problem after I find a solution to the
above problem.
But currently, I think your solution is right because ib_read_bw also gets
a bandwidth limit.

Thanks,
Zhaogeng

--
Thanks
Zhaogeng