Discussion:
[OMPI users] Open MPI over RoCE using breakout cable and switch
Brendan Myers
2017-01-20 22:02:46 UTC
Permalink
Hello,

I am attempting to get Open MPI to run over 2 nodes using a switch and a
single breakout cable with this design:

(100GbE)QSFP <----> 2x (50GbE)QSFP



Hardware Layout:

Breakout cable module A connects to switch (100GbE)

Breakout cable module B1 connects to node 1 RoCE NIC (50GbE)

Breakout cable module B2 connects to node 2 RoCE NIC (50GbE)

Switch is Mellanox SN 2700 100GbE RoCE switch



* I am able to pass RDMA traffic between the nodes with perftest
(ib_write_bw) when using the breakout cable as the IC from both nodes to the
switch.

* When attempting to run a job using the breakout cable as the IC
Open MPI aborts with failure to initialize open fabrics device errors.

* If I replace the breakout cable with 2 standard QSFP cables the
Open MPI job will complete correctly.





This is the command I use, it works unless I attempt a run with the breakout
cable used as IC:

mpirun --mca btl openib,self,sm --mca btl_openib_receive_queues
P,65536,120,64,32 --mca btl_openib_cpc_include rdmacm -hostfile
mpi-hosts-ce /usr/local/bin/IMB-MPI1



If anyone has any idea as to why using a breakout cable is causing my jobs
to fail please let me know.



Thank you,



Brendan T. W. Myers

***@soft-forge.com <mailto:***@soft-forge.com>

Software Forge Inc
Howard Pritchard
2017-01-20 23:34:42 UTC
Permalink
Hi Brendan

I doubt this kind of config has gotten any testing with OMPI. Could you
rerun with

--mca btl_base_verbose 100

added to the command line and post the output to the list?

Howard
Post by Brendan Myers
Hello,
I am attempting to get Open MPI to run over 2 nodes using a switch and a
(100GbE)QSFP ßà 2x (50GbE)QSFP
Breakout cable module A connects to switch (100GbE)
Breakout cable module B1 connects to node 1 RoCE NIC (50GbE)
Breakout cable module B2 connects to node 2 RoCE NIC (50GbE)
Switch is Mellanox SN 2700 100GbE RoCE switch
· I am able to pass RDMA traffic between the nodes with perftest
(ib_write_bw) when using the breakout cable as the IC from both nodes to
the switch.
· When attempting to run a job using the breakout cable as the IC
Open MPI aborts with failure to initialize open fabrics device errors.
· If I replace the breakout cable with 2 standard QSFP cables the
Open MPI job will complete correctly.
This is the command I use, it works unless I attempt a run with the
*mpirun --mca btl openib,self,sm --mca btl_openib_receive_queues
P,65536,120,64,32 --mca btl_openib_cpc_include rdmacm -hostfile
mpi-hosts-ce /usr/local/bin/IMB-MPI1*
If anyone has any idea as to why using a breakout cable is causing my jobs
to fail please let me know.
Thank you,
Brendan T. W. Myers
Software Forge Inc
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Brendan Myers
2017-01-23 15:23:50 UTC
Permalink
Hello Howard,

Thank you for looking into this. Attached is the output you requested. Also, I am using Open MPI 2.0.1.



Thank you,

Brendan



From: users [mailto:users-***@lists.open-mpi.org] On Behalf Of Howard Pritchard
Sent: Friday, January 20, 2017 6:35 PM
To: Open MPI Users <***@lists.open-mpi.org>
Subject: Re: [OMPI users] Open MPI over RoCE using breakout cable and switch



Hi Brendan



I doubt this kind of config has gotten any testing with OMPI. Could you rerun with



--mca btl_base_verbose 100



added to the command line and post the output to the list?



Howard





Brendan Myers <***@soft-forge.com <mailto:***@soft-forge.com> > schrieb am Fr. 20. Jan. 2017 um 15:04:

Hello,

I am attempting to get Open MPI to run over 2 nodes using a switch and a single breakout cable with this design:

(100GbE)QSFP <----> 2x (50GbE)QSFP



Hardware Layout:

Breakout cable module A connects to switch (100GbE)

Breakout cable module B1 connects to node 1 RoCE NIC (50GbE)

Breakout cable module B2 connects to node 2 RoCE NIC (50GbE)

Switch is Mellanox SN 2700 100GbE RoCE switch



* I am able to pass RDMA traffic between the nodes with perftest (ib_write_bw) when using the breakout cable as the IC from both nodes to the switch.

* When attempting to run a job using the breakout cable as the IC Open MPI aborts with failure to initialize open fabrics device errors.

* If I replace the breakout cable with 2 standard QSFP cables the Open MPI job will complete correctly.





This is the command I use, it works unless I attempt a run with the breakout cable used as IC:

mpirun --mca btl openib,self,sm --mca btl_openib_receive_queues P,65536,120,64,32 --mca btl_openib_cpc_include rdmacm -hostfile mpi-hosts-ce /usr/local/bin/IMB-MPI1



If anyone has any idea as to why using a breakout cable is causing my jobs to fail please let me know.



Thank you,



Brendan T. W. Myers

***@soft-forge.com <mailto:***@soft-forge.com>

Software Forge Inc



_______________________________________________

users mailing list

***@lists.open-mpi.org <mailto:***@lists.open-mpi.org>

https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Howard Pritchard
2017-01-24 13:20:40 UTC
Permalink
Hello Brendan,

This helps some, but looks like we need more debug output.

Could you build a debug version of Open MPI by adding --enable-debug
to the config options and rerun the test with the breakout cable setup
and keeping the --mca btl_base_verbose 100 command line option?

Thanks

Howard
Post by Brendan Myers
Hello Howard,
Thank you for looking into this. Attached is the output you requested.
Also, I am using Open MPI 2.0.1.
Thank you,
Brendan
Pritchard
*Sent:* Friday, January 20, 2017 6:35 PM
*Subject:* Re: [OMPI users] Open MPI over RoCE using breakout cable and
switch
Hi Brendan
I doubt this kind of config has gotten any testing with OMPI. Could you rerun with
--mca btl_base_verbose 100
added to the command line and post the output to the list?
Howard
Hello,
I am attempting to get Open MPI to run over 2 nodes using a switch and a
(100GbE)QSFP ßà 2x (50GbE)QSFP
Breakout cable module A connects to switch (100GbE)
Breakout cable module B1 connects to node 1 RoCE NIC (50GbE)
Breakout cable module B2 connects to node 2 RoCE NIC (50GbE)
Switch is Mellanox SN 2700 100GbE RoCE switch
· I am able to pass RDMA traffic between the nodes with perftest
(ib_write_bw) when using the breakout cable as the IC from both nodes to
the switch.
· When attempting to run a job using the breakout cable as the IC
Open MPI aborts with failure to initialize open fabrics device errors.
· If I replace the breakout cable with 2 standard QSFP cables the
Open MPI job will complete correctly.
This is the command I use, it works unless I attempt a run with the
*mpirun --mca btl openib,self,sm --mca btl_openib_receive_queues
P,65536,120,64,32 --mca btl_openib_cpc_include rdmacm -hostfile
mpi-hosts-ce /usr/local/bin/IMB-MPI1*
If anyone has any idea as to why using a breakout cable is causing my jobs
to fail please let me know.
Thank you,
Brendan T. W. Myers
Software Forge Inc
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Brendan Myers
2017-01-24 16:10:46 UTC
Permalink
Hello Howard,

Here is the error output after building with debug enabled. These CX4 Mellanox cards view each port as a separate device and I am using port 1 on the card which is device mlx5_0.



Thank you,

Brendan



From: users [mailto:users-***@lists.open-mpi.org] On Behalf Of Howard Pritchard
Sent: Tuesday, January 24, 2017 8:21 AM
To: Open MPI Users <***@lists.open-mpi.org>
Subject: Re: [OMPI users] Open MPI over RoCE using breakout cable and switch



Hello Brendan,



This helps some, but looks like we need more debug output.



Could you build a debug version of Open MPI by adding --enable-debug

to the config options and rerun the test with the breakout cable setup

and keeping the --mca btl_base_verbose 100 command line option?



Thanks



Howard





2017-01-23 8:23 GMT-07:00 Brendan Myers <***@soft-forge.com <mailto:***@soft-forge.com> >:

Hello Howard,

Thank you for looking into this. Attached is the output you requested. Also, I am using Open MPI 2.0.1.



Thank you,

Brendan



From: users [mailto:users-***@lists.open-mpi.org <mailto:users-***@lists.open-mpi.org> ] On Behalf Of Howard Pritchard
Sent: Friday, January 20, 2017 6:35 PM
To: Open MPI Users <***@lists.open-mpi.org <mailto:***@lists.open-mpi.org> >
Subject: Re: [OMPI users] Open MPI over RoCE using breakout cable and switch



Hi Brendan



I doubt this kind of config has gotten any testing with OMPI. Could you rerun with



--mca btl_base_verbose 100



added to the command line and post the output to the list?



Howard





Brendan Myers <***@soft-forge.com <mailto:***@soft-forge.com> > schrieb am Fr. 20. Jan. 2017 um 15:04:

Hello,

I am attempting to get Open MPI to run over 2 nodes using a switch and a single breakout cable with this design:

(100GbE)QSFP <----> 2x (50GbE)QSFP



Hardware Layout:

Breakout cable module A connects to switch (100GbE)

Breakout cable module B1 connects to node 1 RoCE NIC (50GbE)

Breakout cable module B2 connects to node 2 RoCE NIC (50GbE)

Switch is Mellanox SN 2700 100GbE RoCE switch



* I am able to pass RDMA traffic between the nodes with perftest (ib_write_bw) when using the breakout cable as the IC from both nodes to the switch.

* When attempting to run a job using the breakout cable as the IC Open MPI aborts with failure to initialize open fabrics device errors.

* If I replace the breakout cable with 2 standard QSFP cables the Open MPI job will complete correctly.





This is the command I use, it works unless I attempt a run with the breakout cable used as IC:

mpirun --mca btl openib,self,sm --mca btl_openib_receive_queues P,65536,120,64,32 --mca btl_openib_cpc_include rdmacm -hostfile mpi-hosts-ce /usr/local/bin/IMB-MPI1



If anyone has any idea as to why using a breakout cable is causing my jobs to fail please let me know.



Thank you,



Brendan T. W. Myers

***@soft-forge.com <mailto:***@soft-forge.com>

Software Forge Inc



_______________________________________________

users mailing list

***@lists.open-mpi.org <mailto:***@lists.open-mpi.org>

https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Brendan Myers
2017-02-01 22:08:47 UTC
Permalink
Hello Howard,

I was wondering if you have been able to look at this issue at all, or if anyone has any ideas on what to try next.



Thank you,

Brendan



From: users [mailto:users-***@lists.open-mpi.org] On Behalf Of Brendan Myers
Sent: Tuesday, January 24, 2017 11:11 AM
To: 'Open MPI Users' <***@lists.open-mpi.org>
Subject: Re: [OMPI users] Open MPI over RoCE using breakout cable and switch



Hello Howard,

Here is the error output after building with debug enabled. These CX4 Mellanox cards view each port as a separate device and I am using port 1 on the card which is device mlx5_0.



Thank you,

Brendan



From: users [mailto:users-***@lists.open-mpi.org] On Behalf Of Howard Pritchard
Sent: Tuesday, January 24, 2017 8:21 AM
To: Open MPI Users <***@lists.open-mpi.org <mailto:***@lists.open-mpi.org> >
Subject: Re: [OMPI users] Open MPI over RoCE using breakout cable and switch



Hello Brendan,



This helps some, but looks like we need more debug output.



Could you build a debug version of Open MPI by adding --enable-debug

to the config options and rerun the test with the breakout cable setup

and keeping the --mca btl_base_verbose 100 command line option?



Thanks



Howard





2017-01-23 8:23 GMT-07:00 Brendan Myers <***@soft-forge.com <mailto:***@soft-forge.com> >:

Hello Howard,

Thank you for looking into this. Attached is the output you requested. Also, I am using Open MPI 2.0.1.



Thank you,

Brendan



From: users [mailto:users-***@lists.open-mpi.org <mailto:users-***@lists.open-mpi.org> ] On Behalf Of Howard Pritchard
Sent: Friday, January 20, 2017 6:35 PM
To: Open MPI Users <***@lists.open-mpi.org <mailto:***@lists.open-mpi.org> >
Subject: Re: [OMPI users] Open MPI over RoCE using breakout cable and switch



Hi Brendan



I doubt this kind of config has gotten any testing with OMPI. Could you rerun with



--mca btl_base_verbose 100



added to the command line and post the output to the list?



Howard





Brendan Myers <***@soft-forge.com <mailto:***@soft-forge.com> > schrieb am Fr. 20. Jan. 2017 um 15:04:

Hello,

I am attempting to get Open MPI to run over 2 nodes using a switch and a single breakout cable with this design:

(100GbE)QSFP <----> 2x (50GbE)QSFP



Hardware Layout:

Breakout cable module A connects to switch (100GbE)

Breakout cable module B1 connects to node 1 RoCE NIC (50GbE)

Breakout cable module B2 connects to node 2 RoCE NIC (50GbE)

Switch is Mellanox SN 2700 100GbE RoCE switch



* I am able to pass RDMA traffic between the nodes with perftest (ib_write_bw) when using the breakout cable as the IC from both nodes to the switch.

* When attempting to run a job using the breakout cable as the IC Open MPI aborts with failure to initialize open fabrics device errors.

* If I replace the breakout cable with 2 standard QSFP cables the Open MPI job will complete correctly.





This is the command I use, it works unless I attempt a run with the breakout cable used as IC:

mpirun --mca btl openib,self,sm --mca btl_openib_receive_queues P,65536,120,64,32 --mca btl_openib_cpc_include rdmacm -hostfile mpi-hosts-ce /usr/local/bin/IMB-MPI1



If anyone has any idea as to why using a breakout cable is causing my jobs to fail please let me know.



Thank you,



Brendan T. W. Myers

***@soft-forge.com <mailto:***@soft-forge.com>

Software Forge Inc



_______________________________________________

users mailing list

***@lists.open-mpi.org <mailto:***@lists.open-mpi.org>

https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Howard Pritchard
2017-02-03 17:52:40 UTC
Permalink
Hello Brendan,

Sorry for the delay in responding. I've been on travel the past two weeks.

I traced through the debug output you sent. It provided enough information
to show that for some reason, when using the breakout cable, Open MPI
is unable to complete initialization it needs to use the openib BTL. It
correctly detects that the first port is not available, but for port 1, it
still fails to initialize.

To debug this further, I'd need to provide you with a custom Open MPI
to try that would have more debug output in the suspect area.

If you'd like to go this route let me know and I'll build a one of library
to try to debug this problem.

One thing to do just as a sanity check is to try tcp:

mpirun --mca btl tcp,self,sm ....

with the breakout cable. If that doesn't work, then I think there may
be some network setup problem that needs to be resolved first before
trying custom Open MPI tarballs.

Thanks,

Howard
Post by Brendan Myers
Hello Howard,
I was wondering if you have been able to look at this issue at all, or if
anyone has any ideas on what to try next.
Thank you,
Brendan
Myers
*Sent:* Tuesday, January 24, 2017 11:11 AM
*Subject:* Re: [OMPI users] Open MPI over RoCE using breakout cable and
switch
Hello Howard,
Here is the error output after building with debug enabled. These CX4
Mellanox cards view each port as a separate device and I am using port 1 on
the card which is device mlx5_0.
Thank you,
Brendan
*Sent:* Tuesday, January 24, 2017 8:21 AM
*Subject:* Re: [OMPI users] Open MPI over RoCE using breakout cable and
switch
Hello Brendan,
This helps some, but looks like we need more debug output.
Could you build a debug version of Open MPI by adding --enable-debug
to the config options and rerun the test with the breakout cable setup
and keeping the --mca btl_base_verbose 100 command line option?
Thanks
Howard
Hello Howard,
Thank you for looking into this. Attached is the output you requested.
Also, I am using Open MPI 2.0.1.
Thank you,
Brendan
Pritchard
*Sent:* Friday, January 20, 2017 6:35 PM
*Subject:* Re: [OMPI users] Open MPI over RoCE using breakout cable and
switch
Hi Brendan
I doubt this kind of config has gotten any testing with OMPI. Could you rerun with
--mca btl_base_verbose 100
added to the command line and post the output to the list?
Howard
Hello,
I am attempting to get Open MPI to run over 2 nodes using a switch and a
(100GbE)QSFP ßà 2x (50GbE)QSFP
Breakout cable module A connects to switch (100GbE)
Breakout cable module B1 connects to node 1 RoCE NIC (50GbE)
Breakout cable module B2 connects to node 2 RoCE NIC (50GbE)
Switch is Mellanox SN 2700 100GbE RoCE switch
· I am able to pass RDMA traffic between the nodes with perftest
(ib_write_bw) when using the breakout cable as the IC from both nodes to
the switch.
· When attempting to run a job using the breakout cable as the IC
Open MPI aborts with failure to initialize open fabrics device errors.
· If I replace the breakout cable with 2 standard QSFP cables the
Open MPI job will complete correctly.
This is the command I use, it works unless I attempt a run with the
*mpirun --mca btl openib,self,sm --mca btl_openib_receive_queues
P,65536,120,64,32 --mca btl_openib_cpc_include rdmacm -hostfile
mpi-hosts-ce /usr/local/bin/IMB-MPI1*
If anyone has any idea as to why using a breakout cable is causing my jobs
to fail please let me know.
Thank you,
Brendan T. W. Myers
Software Forge Inc
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Brendan Myers
2017-02-07 16:55:56 UTC
Permalink
Hello Howard,

I am able to run my Open MPI job to completion over TCP as you suggested for a sanity/configuration double check. I also am able to complete the job using the RoCE fabric if I swap the breakout cable with 2 regular RoCE cables. I am willing to test some custom builds to help iron out this problem. Thank you again for your time and effort.



Brendan



From: users [mailto:users-***@lists.open-mpi.org] On Behalf Of Howard Pritchard
Sent: Friday, February 03, 2017 12:53 PM
To: Open MPI Users <***@lists.open-mpi.org>
Subject: Re: [OMPI users] Open MPI over RoCE using breakout cable and switch



Hello Brendan,



Sorry for the delay in responding. I've been on travel the past two weeks.



I traced through the debug output you sent. It provided enough information

to show that for some reason, when using the breakout cable, Open MPI

is unable to complete initialization it needs to use the openib BTL. It

correctly detects that the first port is not available, but for port 1, it

still fails to initialize.



To debug this further, I'd need to provide you with a custom Open MPI

to try that would have more debug output in the suspect area.



If you'd like to go this route let me know and I'll build a one of library

to try to debug this problem.



One thing to do just as a sanity check is to try tcp:



mpirun --mca btl tcp,self,sm ....



with the breakout cable. If that doesn't work, then I think there may

be some network setup problem that needs to be resolved first before

trying custom Open MPI tarballs.



Thanks,



Howard









2017-02-01 15:08 GMT-07:00 Brendan Myers <***@soft-forge.com <mailto:***@soft-forge.com> >:

Hello Howard,

I was wondering if you have been able to look at this issue at all, or if anyone has any ideas on what to try next.



Thank you,

Brendan



From: users [mailto:users-***@lists.open-mpi.org <mailto:users-***@lists.open-mpi.org> ] On Behalf Of Brendan Myers
Sent: Tuesday, January 24, 2017 11:11 AM


To: 'Open MPI Users' <***@lists.open-mpi.org <mailto:***@lists.open-mpi.org> >
Subject: Re: [OMPI users] Open MPI over RoCE using breakout cable and switch



Hello Howard,

Here is the error output after building with debug enabled. These CX4 Mellanox cards view each port as a separate device and I am using port 1 on the card which is device mlx5_0.



Thank you,

Brendan



From: users [mailto:users-***@lists.open-mpi.org] On Behalf Of Howard Pritchard
Sent: Tuesday, January 24, 2017 8:21 AM
To: Open MPI Users <***@lists.open-mpi.org <mailto:***@lists.open-mpi.org> >
Subject: Re: [OMPI users] Open MPI over RoCE using breakout cable and switch



Hello Brendan,



This helps some, but looks like we need more debug output.



Could you build a debug version of Open MPI by adding --enable-debug

to the config options and rerun the test with the breakout cable setup

and keeping the --mca btl_base_verbose 100 command line option?



Thanks



Howard





2017-01-23 8:23 GMT-07:00 Brendan Myers <***@soft-forge.com <mailto:***@soft-forge.com> >:

Hello Howard,

Thank you for looking into this. Attached is the output you requested. Also, I am using Open MPI 2.0.1.



Thank you,

Brendan



From: users [mailto:users-***@lists.open-mpi.org <mailto:users-***@lists.open-mpi.org> ] On Behalf Of Howard Pritchard
Sent: Friday, January 20, 2017 6:35 PM
To: Open MPI Users <***@lists.open-mpi.org <mailto:***@lists.open-mpi.org> >
Subject: Re: [OMPI users] Open MPI over RoCE using breakout cable and switch



Hi Brendan



I doubt this kind of config has gotten any testing with OMPI. Could you rerun with



--mca btl_base_verbose 100



added to the command line and post the output to the list?



Howard





Brendan Myers <***@soft-forge.com <mailto:***@soft-forge.com> > schrieb am Fr. 20. Jan. 2017 um 15:04:

Hello,

I am attempting to get Open MPI to run over 2 nodes using a switch and a single breakout cable with this design:

(100GbE)QSFP <----> 2x (50GbE)QSFP



Hardware Layout:

Breakout cable module A connects to switch (100GbE)

Breakout cable module B1 connects to node 1 RoCE NIC (50GbE)

Breakout cable module B2 connects to node 2 RoCE NIC (50GbE)

Switch is Mellanox SN 2700 100GbE RoCE switch



* I am able to pass RDMA traffic between the nodes with perftest (ib_write_bw) when using the breakout cable as the IC from both nodes to the switch.

* When attempting to run a job using the breakout cable as the IC Open MPI aborts with failure to initialize open fabrics device errors.

* If I replace the breakout cable with 2 standard QSFP cables the Open MPI job will complete correctly.





This is the command I use, it works unless I attempt a run with the breakout cable used as IC:

mpirun --mca btl openib,self,sm --mca btl_openib_receive_queues P,65536,120,64,32 --mca btl_openib_cpc_include rdmacm -hostfile mpi-hosts-ce /usr/local/bin/IMB-MPI1



If anyone has any idea as to why using a breakout cable is causing my jobs to fail please let me know.



Thank you,



Brendan T. W. Myers

***@soft-forge.com <mailto:***@soft-forge.com>

Software Forge Inc



_______________________________________________

users mailing list

***@lists.open-mpi.org <mailto:***@lists.open-mpi.org>

https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Loading...