[OMPI users] RDMA over Ethernet in Open MPI

Discussion:

[OMPI users] RDMA over Ethernet in Open MPI - RoCE on AWS?

Benjamin Brock

2018-09-07 02:10:03 UTC

I'm setting up a cluster on AWS, which will have a 10Gb/s or 25Gb/s
Ethernet network. Should I expect to be able to get RoCE to work in Open
MPI on AWS?

More generally, what optimizations and performance tuning can I do to an
Open MPI installation to get good performance on an Ethernet network?

My codes use a lot of random access AMOs and asynchronous block transfers,
so it seems to me like setting up RDMA over Ethernet would be essential to
getting good performance, but I can't seem to find much information about
it online.

Any pointers you have would be appreciated.

Ben

John Hearns via users

2018-09-07 06:58:01 UTC

Permalink

Ben, ping me off list. I know the guy who heads the HPC Solutions
Architect team for AWS and an AWS Solutions Architect here in the UK.

I'm setting up a cluster on AWS, which will have a 10Gb/s or 25Gb/s Ethernet network. Should I expect to be able to get RoCE to work in Open MPI on AWS?
More generally, what optimizations and performance tuning can I do to an Open MPI installation to get good performance on an Ethernet network?
My codes use a lot of random access AMOs and asynchronous block transfers, so it seems to me like setting up RDMA over Ethernet would be essential to getting good performance, but I can't seem to find much information about it online.
Any pointers you have would be appreciated.
Ben
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

Barrett, Brian via users

2018-09-10 21:51:49 UTC

Permalink

It sounds like what you’re asking is “how do I get the best performance from Open MPI in AWS?”.

The TCP BTL is your best option for performance in AWS. RoCE is going to be a bunch of work to get setup, and you’ll still end up with host processing of every packet. There are a couple simple instance tweaks that can make a big difference. AWS has published a very nice guide for setting up an EDA workload environment [1], which has a number of useful tweaks, particularly if you’re using C4 or earlier compute instances. The biggest improvement, however, is to make sure you’re using a version of Open MPI newer than 2.1.2. We fixed some fairly serious performance issues in the Open MPI TCP stack (that, humorously enough, were also in the MPICH TCP stack and have been fixed there as well) in 2.1.2.

Given that your application is fairly asynchronous, you might want to experiment with the btl_tcp_progress_thread MCA parameter. If your application benefits from asynchronous progress, using a progress thread might be the best option.

Brian

I'm setting up a cluster on AWS, which will have a 10Gb/s or 25Gb/s Ethernet network. Should I expect to be able to get RoCE to work in Open MPI on AWS?
More generally, what optimizations and performance tuning can I do to an Open MPI installation to get good performance on an Ethernet network?
My codes use a lot of random access AMOs and asynchronous block transfers, so it seems to me like setting up RDMA over Ethernet would be essential to getting good performance, but I can't seem to find much information about it online.
Any pointers you have would be appreciated.
Ben
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

Benjamin Brock

2018-09-11 17:46:02 UTC

Permalink

Thanks for your response.

One question: why would RoCE still requiring host processing of every
packet? I thought the point was that some nice server Ethernet NICs can
handle RDMA requests directly? Or am I misunderstanding RoCE/how Open
MPI's RoCE transport?

Ben

Jeff Hammond

2018-09-11 19:06:29 UTC

Permalink

Are you trying to run UPC++ over MPI in the cloud?

Jeff

Post by Benjamin Brock
Thanks for your response.
One question: why would RoCE still requiring host processing of every
packet? I thought the point was that some nice server Ethernet NICs can
handle RDMA requests directly? Or am I misunderstanding RoCE/how Open
MPI's RoCE transport?
Ben
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

--
Jeff Hammond
***@gmail.com
http://jeffhammond.github.io/

Barrett, Brian via users

2018-09-27 15:50:52 UTC

Permalink

On Sep 11, 2018, at 10:46 AM, Benjamin Brock <***@cs.berkeley.edu<mailto:***@cs.berkeley.edu>> wrote:

Thanks for your response.

One question: why would RoCE still requiring host processing of every packet? I thought the point was that some nice server Ethernet NICs can handle RDMA requests directly? Or am I misunderstanding RoCE/how Open MPI's RoCE transport?

Sorry, I missed your follow-up question.

Thereâs nothing that says that RoCE *must* be implemented in the NIC. It is entirely possible to write a host-side kernel driver to implement the RoCE protocol. My point was that if you were to do this, you wouldnât have any of the benefits that people expect with RoCE, but the protocol would work just fine. Similar to how you can write a VERBS implementation over DPDK and run the entire protocol in user space (https://github.com/zrlio/urdma). While I havenât tested the urdma package, both the Intel 82599 and ENA support DPDK, so if what youâre looking for is a VERBS stack, that might be one option. Personally, Iâd just use Open MPI over TCP if Open MPI is your goal, because that sounds like a lot of headaches in configuration.

Brian