Kenny, Joseph P via users
2018-11-29 17:19:57 UTC
Hi,
Iâm trying to do some RoCE benchmarking on a cluster with Mellanox HCAâs:
02:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
MLNX_OFED_LINUX-4.4-2.0.7.0
Iâm finding it quite challenging to understand what btl is actually being used based on openmpiâs debug output. Iâm using openmpi 4.0.0 (along with a handful of older releases). For example, hereâs a command line that I use to run a 16 node HPL test, trying to ensure that internode communication goes over a RoCE-capable btl rather than tcp:
/home/jpkenny/install/openmpi-4.0.0-carnac/bin/mpirun --mca btl_base_verbose 100 --mca btl ^tcp -n 64 -N 4 -hostfile hosts.txt ./xhpl
Among the interesting debug messages I see are messages of the form:
[en257.eth:118902] openib BTL: rdmacm CPC unavailable for use on mlx5_0:1; skipped
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: en254
Local device: mlx5_0
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
[en262.eth:103810] select: init of component openib returned failure
[en264.eth:171198] select: init of component openib returned failure
[en264.eth:171198] mca: base: close: component openib closed
[en264.eth:171198] mca: base: close: unloading component openib
[en264.eth:171198] select: initializing btl component uct
[en264.eth:171198] select: init of component uct returned failure
[en264.eth:171198] mca: base: close: component uct closed
[en264.eth:171198] mca: base: close: unloading component uct
So, it looks to me like openib and uct transports are both failing, yet when I read out rdma counters with ethtool I see that the bulk of the traffic is going over rdma somehow (eth2 is the MT27800):
ib counters before:
rx_vport_rdma_unicast_packets: 115943830
rx_vport_rdma_unicast_bytes: 195602189248
tx_vport_rdma_unicast_packets: 273170117
tx_vport_rdma_unicast_bytes: 374057100818
eth0 counters before:
RX packets 87474728 bytes 43335706060 (40.3 GiB)
TX packets 61137838 bytes 71187999781 (66.2 GiB)
eth2 counters before:
RX packets 49490077 bytes 81084834515 (75.5 GiB)
TX packets 532970764 bytes 1742134134428 (1.5 TiB)
ib counters after:
rx_vport_rdma_unicast_packets: 117188033
rx_vport_rdma_unicast_bytes: 200088022302
tx_vport_rdma_unicast_packets: 274456328
tx_vport_rdma_unicast_bytes: 378587627052
eth0 counters after:
RX packets 87481208 bytes 43336915153 (40.3 GiB)
TX packets 61143485 bytes 71189606766 (66.3 GiB)
eth2 counters after:
RX packets 49490077 bytes 81084834515 (75.5 GiB)
TX packets 532970764 bytes 1742134134428 (1.5 TiB)
Yet, looking at the debug output after xhpl runs, I only see vader and self getting unloaded. The evidence suggests that there is no working intranode btl, yet the job runs properly and it looks like rdma transfers are occurring. Equally perplexing behavior is observed when I exclude openib/uct and expect to run over tcp. Whatâs actually going on here?
Iâll attach output from ompi_info along with the debug output that Iâm referring to. I tried to include a compressed config.log, but the message was too big.
Thanks,
Joe
Iâm trying to do some RoCE benchmarking on a cluster with Mellanox HCAâs:
02:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
MLNX_OFED_LINUX-4.4-2.0.7.0
Iâm finding it quite challenging to understand what btl is actually being used based on openmpiâs debug output. Iâm using openmpi 4.0.0 (along with a handful of older releases). For example, hereâs a command line that I use to run a 16 node HPL test, trying to ensure that internode communication goes over a RoCE-capable btl rather than tcp:
/home/jpkenny/install/openmpi-4.0.0-carnac/bin/mpirun --mca btl_base_verbose 100 --mca btl ^tcp -n 64 -N 4 -hostfile hosts.txt ./xhpl
Among the interesting debug messages I see are messages of the form:
[en257.eth:118902] openib BTL: rdmacm CPC unavailable for use on mlx5_0:1; skipped
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: en254
Local device: mlx5_0
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
[en262.eth:103810] select: init of component openib returned failure
[en264.eth:171198] select: init of component openib returned failure
[en264.eth:171198] mca: base: close: component openib closed
[en264.eth:171198] mca: base: close: unloading component openib
[en264.eth:171198] select: initializing btl component uct
[en264.eth:171198] select: init of component uct returned failure
[en264.eth:171198] mca: base: close: component uct closed
[en264.eth:171198] mca: base: close: unloading component uct
So, it looks to me like openib and uct transports are both failing, yet when I read out rdma counters with ethtool I see that the bulk of the traffic is going over rdma somehow (eth2 is the MT27800):
ib counters before:
rx_vport_rdma_unicast_packets: 115943830
rx_vport_rdma_unicast_bytes: 195602189248
tx_vport_rdma_unicast_packets: 273170117
tx_vport_rdma_unicast_bytes: 374057100818
eth0 counters before:
RX packets 87474728 bytes 43335706060 (40.3 GiB)
TX packets 61137838 bytes 71187999781 (66.2 GiB)
eth2 counters before:
RX packets 49490077 bytes 81084834515 (75.5 GiB)
TX packets 532970764 bytes 1742134134428 (1.5 TiB)
ib counters after:
rx_vport_rdma_unicast_packets: 117188033
rx_vport_rdma_unicast_bytes: 200088022302
tx_vport_rdma_unicast_packets: 274456328
tx_vport_rdma_unicast_bytes: 378587627052
eth0 counters after:
RX packets 87481208 bytes 43336915153 (40.3 GiB)
TX packets 61143485 bytes 71189606766 (66.3 GiB)
eth2 counters after:
RX packets 49490077 bytes 81084834515 (75.5 GiB)
TX packets 532970764 bytes 1742134134428 (1.5 TiB)
Yet, looking at the debug output after xhpl runs, I only see vader and self getting unloaded. The evidence suggests that there is no working intranode btl, yet the job runs properly and it looks like rdma transfers are occurring. Equally perplexing behavior is observed when I exclude openib/uct and expect to run over tcp. Whatâs actually going on here?
Iâll attach output from ompi_info along with the debug output that Iâm referring to. I tried to include a compressed config.log, but the message was too big.
Thanks,
Joe