Discussion:
[OMPI users] rdmacm and udcm failure in 2.0.1 on RoCE
Dave Turner
2016-12-14 21:47:17 UTC
Permalink
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

Local host: elf22
Local device: mlx4_2
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------

We have had no problems using 1.10.4 on RoCE but 2.0.1 fails to
find either connection manager. I've read that rdmacm may have
issues under 2.0.1 so udcm may be the only one working. Are there
any known issues with that on RoCE? Or does this just mean we
don't have RoCE configured correctly?

Dave Turner
--
Work: ***@ksu.edu (785) 532-7791
2219 Engineering Hall, Manhattan KS 66506
Home: ***@gmail.com
cell: (785) 770-5929
Nathan Hjelm
2016-12-15 04:12:16 UTC
Permalink
Can you configure with —enable-debug and run with —mca btl_base_verbose 100 and provide the output? It may indicate why neither udcm nor rdmacm are available.

-Nathan
Post by Dave Turner
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: elf22
Local device: mlx4_2
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
We have had no problems using 1.10.4 on RoCE but 2.0.1 fails to
find either connection manager. I've read that rdmacm may have
issues under 2.0.1 so udcm may be the only one working. Are there
any known issues with that on RoCE? Or does this just mean we
don't have RoCE configured correctly?
Dave Turner
--
2219 Engineering Hall, Manhattan KS 66506
cell: (785) 770-5929
<ompi_info.2.0.1.all>_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Dave Turner
2016-12-15 21:40:38 UTC
Permalink
Nathan: Thanks for providing the debug flags. I've attached the
output (NetPIPE.debug1) which basically shows that for RoCE the
udcm_component_query() will always fail. Can someone verify if
this is correct that udcm is not supported for RoCE? When I change
the test to force usage it does not work (NetPIPE.debug2).

[hero35][[38845,1],0][connect/btl_openib_connect_udcm.c:452:udcm_component_query]
UD CPC only supported on InfiniBand; skipped on mlx4_0:1
[hero35][[38845,1],0][connect/btl_openib_connect_udcm.c:501:udcm_component_query]
unavailable for use on mlx4_0:1; skipped

from btl_openib_connect_udcm.c

438 static int udcm_component_query(mca_btl_openib_module_t *btl,
439 opal_btl_openib_connect_base_module_t
**cpc)
440 {
441 udcm_module_t *m = NULL;
442 int rc = OPAL_ERR_NOT_SUPPORTED;
443
444 do {
445 /* If we do not have struct ibv_device.transport_device, then
446 we're in an old version of OFED that is IB only (i.e., no
447 iWarp), so we can safely assume that we can use this CPC. */
448 #if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE) &&
HAVE_DECL_IBV_LINK_LAYER_ETHERN ET
449 if (BTL_OPENIB_CONNECT_BASE_CHECK_IF_NOT_IB(btl)) {
450 BTL_VERBOSE(("UD CPC only supported on InfiniBand; skipped
on %s:%d",
451 ibv_get_device_name(btl->device->ib_dev),
452 btl->port_num));
453 break;
454 }
455 #endif

from base.h

#ifdef OPAL_HAVE_RDMAOE
#define BTL_OPENIB_CONNECT_BASE_CHECK_IF_NOT_IB(btl) \
(((IBV_TRANSPORT_IB != ((btl)->device->ib_dev->transport_type)) || \
(IBV_LINK_LAYER_ETHERNET == ((btl)->ib_port_attr.link_layer))) ? \
true : false)
#else
#define BTL_OPENIB_CONNECT_BASE_CHECK_IF_NOT_IB(btl) \
((IBV_TRANSPORT_IB != ((btl)->device->ib_dev->transport_type)) ? \
true : false)
#endif

So clearly for RoCE the transport is InfiniBand and the link layer is
Ethernet
so this will show that NOT_IB() is true, meaning that udcm is evidently
not supported for RoCE. udcm definitely fails under 1.10.4 for RoCE in
our tests. That means we need rdmacm to work which it evidently does
not at the moment for 2.0.1. Could someone please verify that rdmacm
is not currently working in 2.0.1? And therefore I'm assuming that
2.0.1 has not been successfully tested on RoCE???

Dave
Post by Dave Turner
----------------------------------------------------------------------
Message: 1
Date: Wed, 14 Dec 2016 21:12:16 -0700
Subject: Re: [OMPI users] rdmacm and udcm failure in 2.0.1 on RoCE
Content-Type: text/plain; charset=utf-8
Can you configure with ?enable-debug and run with ?mca btl_base_verbose
100 and provide the output? It may indicate why neither udcm nor rdmacm are
available.
-Nathan
Post by Dave Turner
------------------------------------------------------------
--------------
Post by Dave Turner
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: elf22
Local device: mlx4_2
Local port: 1
CPCs attempted: rdmacm, udcm
------------------------------------------------------------
--------------
Post by Dave Turner
We have had no problems using 1.10.4 on RoCE but 2.0.1 fails to
find either connection manager. I've read that rdmacm may have
issues under 2.0.1 so udcm may be the only one working. Are there
any known issues with that on RoCE? Or does this just mean we
don't have RoCE configured correctly?
Dave Turner
--
2219 Engineering Hall, Manhattan KS 66506
cell: (785) 770-5929
<ompi_info.2.0.1.all>_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Work: ***@ksu.edu (785) 532-7791
2219 Engineering Hall, Manhattan KS 66506
Home: ***@gmail.com
cell: (785) 770-5929
Brendan Myers
2016-12-16 20:35:21 UTC
Permalink
Hello,

I can confirm that using these flags:

--mca btl_openib_receive_queues P,65536,120,64,32 --mca btl_openib_cpc_include rdmacm

I am able to run Open MPI version 2.0.1 over a RoCE fabric. Hope this helps



Thank you,

Brendan Myers

Software Forge



From: users [mailto:users-***@lists.open-mpi.org] On Behalf Of Dave Turner
Sent: Thursday, December 15, 2016 4:41 PM
To: ***@lists.open-mpi.org
Subject: Re: [OMPI users] rdmacm and udcm failure in 2.0.1 on RoCE





Nathan: Thanks for providing the debug flags. I've attached the

output (NetPIPE.debug1) which basically shows that for RoCE the

udcm_component_query() will always fail. Can someone verify if

this is correct that udcm is not supported for RoCE? When I change

the test to force usage it does not work (NetPIPE.debug2).



[hero35][[38845,1],0][connect/btl_openib_connect_udcm.c:452:udcm_component_query] UD CPC only supported on InfiniBand; skipped on mlx4_0:1

[hero35][[38845,1],0][connect/btl_openib_connect_udcm.c:501:udcm_component_query] unavailable for use on mlx4_0:1; skipped



from btl_openib_connect_udcm.c



438 static int udcm_component_query(mca_btl_openib_module_t *btl,

439 opal_btl_openib_connect_base_module_t **cpc)

440 {

441 udcm_module_t *m = NULL;

442 int rc = OPAL_ERR_NOT_SUPPORTED;

443

444 do {

445 /* If we do not have struct ibv_device.transport_device, then

446 we're in an old version of OFED that is IB only (i.e., no

447 iWarp), so we can safely assume that we can use this CPC. */

448 #if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE) && HAVE_DECL_IBV_LINK_LAYER_ETHERN ET

449 if (BTL_OPENIB_CONNECT_BASE_CHECK_IF_NOT_IB(btl)) {

450 BTL_VERBOSE(("UD CPC only supported on InfiniBand; skipped on %s:%d",

451 ibv_get_device_name(btl->device->ib_dev),

452 btl->port_num));

453 break;

454 }

455 #endif



from base.h



#ifdef OPAL_HAVE_RDMAOE

#define BTL_OPENIB_CONNECT_BASE_CHECK_IF_NOT_IB(btl) \

(((IBV_TRANSPORT_IB != ((btl)->device->ib_dev->transport_type)) || \

(IBV_LINK_LAYER_ETHERNET == ((btl)->ib_port_attr.link_layer))) ? \

true : false)

#else

#define BTL_OPENIB_CONNECT_BASE_CHECK_IF_NOT_IB(btl) \

((IBV_TRANSPORT_IB != ((btl)->device->ib_dev->transport_type)) ? \

true : false)

#endif



So clearly for RoCE the transport is InfiniBand and the link layer is Ethernet

so this will show that NOT_IB() is true, meaning that udcm is evidently

not supported for RoCE. udcm definitely fails under 1.10.4 for RoCE in

our tests. That means we need rdmacm to work which it evidently does

not at the moment for 2.0.1. Could someone please verify that rdmacm

is not currently working in 2.0.1? And therefore I'm assuming that

2.0.1 has not been successfully tested on RoCE???



Dave






----------------------------------------------------------------------

Message: 1
Date: Wed, 14 Dec 2016 21:12:16 -0700
From: Nathan Hjelm <***@me.com <mailto:***@me.com> >
To: ***@gmail.com <mailto:***@gmail.com> , Open MPI Users <***@lists.open-mpi.org <mailto:***@lists.open-mpi.org> >
Subject: Re: [OMPI users] rdmacm and udcm failure in 2.0.1 on RoCE
Message-ID: <32528C5D-14BC-42CE-B19A-***@me.com <mailto:32528C5D-14BC-42CE-B19A-***@me.com> >
Content-Type: text/plain; charset=utf-8

Can you configure with ?enable-debug and run with ?mca btl_base_verbose 100 and provide the output? It may indicate why neither udcm nor rdmacm are available.

-Nathan
Post by Dave Turner
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: elf22
Local device: mlx4_2
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
We have had no problems using 1.10.4 on RoCE but 2.0.1 fails to
find either connection manager. I've read that rdmacm may have
issues under 2.0.1 so udcm may be the only one working. Are there
any known issues with that on RoCE? Or does this just mean we
don't have RoCE configured correctly?
Dave Turner
--
2219 Engineering Hall, Manhattan KS 66506
cell: (785) 770-5929 <tel:%28785%29%20770-5929>
<ompi_info.2.0.1.all>_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Work: ***@ksu.edu <mailto:***@ksu.edu> (785) 532-7791

2219 Engineering Hall, Manhattan KS 66506
Home: ***@gmail.com <mailto:***@gmail.com>
cell: (785) 770-5929
Loading...