[OMPI users] Network performance over TCP

Discussion:

Adam Sylvester

2017-07-09 13:10:14 UTC

I am using Open MPI 2.1.0 on RHEL 7. My application has one unavoidable
pinch point where a large amount of data needs to be transferred (about 8
GB of data needs to be both sent to and received all other ranks), and I'm
seeing worse performance than I would expect; this step has a major impact
on my overall runtime. In the real application, I am using MPI_Alltoall()
for this step, but for the purpose of a simple benchmark, I simplified to
simply do a single MPI_Send() / MPI_Recv() between two ranks of a 2 GB
buffer.

I'm running this in AWS with instances that have 10 Gbps connectivity in
the same availability zone (according to tracepath, there are no hops
between them) and MTU set to 8801 bytes. Doing a non-MPI benchmark of
sending data directly over TCP between these two instances, I reliably get
around 4 Gbps. Between these same two instances with MPI_Send() /
MPI_Recv(), I reliably get around 2.4 Gbps. This seems like a major
performance degradation for a single MPI operation.

I compiled Open MPI 2.1.0 with gcc 4.9.1 and default settings. I'm
connecting between instances via ssh and using I assume TCP for the actual
network transfer (I'm not setting any special command-line or programmatic
settings). The actual command I'm running is:
mpirun -N 1 --bind-to none --hostfile hosts.txt my_app

Any advice on other things to test or compilation and/or runtime flags to
set would be much appreciated!
-Adam

Gilles Gouaillardet

2017-07-09 13:26:45 UTC

Permalink

Adam,

at first, you need to change the default send and receive socket buffers :
mpirun --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 ...
/* note this will be the default from Open MPI 2.1.2 */

hopefully, that will be enough to greatly improve the bandwidth for
large messages.

generally speaking, i recommend you use the latest (e.g. Open MPI
2.1.1) available version

how many interfaces can be used to communicate between hosts ?
if there is more than one (for example a slow and a fast one), you'd
rather only use the fast one.
for example, if eth0 is the fast interface, that can be achieved with
mpirun --mca btl_tcp_if_include eth0 ...

also, you might be able to achieve better results by using more than
one socket on the fast interface.
for example, if you want to use 4 sockets per interface
mpirun --mca btl_tcp_links 4 ...

Cheers,

Gilles

Post by Adam Sylvester
I am using Open MPI 2.1.0 on RHEL 7. My application has one unavoidable
pinch point where a large amount of data needs to be transferred (about 8 GB
of data needs to be both sent to and received all other ranks), and I'm
seeing worse performance than I would expect; this step has a major impact
on my overall runtime. In the real application, I am using MPI_Alltoall()
for this step, but for the purpose of a simple benchmark, I simplified to
simply do a single MPI_Send() / MPI_Recv() between two ranks of a 2 GB
buffer.
I'm running this in AWS with instances that have 10 Gbps connectivity in the
same availability zone (according to tracepath, there are no hops between
them) and MTU set to 8801 bytes. Doing a non-MPI benchmark of sending data
directly over TCP between these two instances, I reliably get around 4 Gbps.
Between these same two instances with MPI_Send() / MPI_Recv(), I reliably
get around 2.4 Gbps. This seems like a major performance degradation for a
single MPI operation.
I compiled Open MPI 2.1.0 with gcc 4.9.1 and default settings. I'm
connecting between instances via ssh and using I assume TCP for the actual
network transfer (I'm not setting any special command-line or programmatic
mpirun -N 1 --bind-to none --hostfile hosts.txt my_app
Any advice on other things to test or compilation and/or runtime flags to
set would be much appreciated!
-Adam
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Adam Sylvester

2017-07-09 14:04:19 UTC

Permalink

Gilles,

Thanks for the fast response!

The --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 flags you recommended
made a huge difference - this got me up to 5.7 Gb/s! I wasn't aware of
these flags... with a little Googling, is
https://www.open-mpi.org/faq/?category=tcp the best place to look for this
kind of information and any other tweaks I may want to try (or if there's a
better FAQ out there, please let me know)?

There is only eth0 on my machines so nothing to tweak there (though good to
know for the future). I also didn't see any improvement by specifying more
sockets per instance. But, your initial suggestion had a major impact.

In general I try to stay relatively up to date with my Open MPI version;
I'll be extra motivated to upgrade to 2.1.2 so that I don't have to
remember to set these --mca flags on the command line. :o)

-Adam

On Sun, Jul 9, 2017 at 9:26 AM, Gilles Gouaillardet <

Post by Gilles Gouaillardet
Adam,
mpirun --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 ...
/* note this will be the default from Open MPI 2.1.2 */
hopefully, that will be enough to greatly improve the bandwidth for
large messages.
generally speaking, i recommend you use the latest (e.g. Open MPI
2.1.1) available version
how many interfaces can be used to communicate between hosts ?
if there is more than one (for example a slow and a fast one), you'd
rather only use the fast one.
for example, if eth0 is the fast interface, that can be achieved with
mpirun --mca btl_tcp_if_include eth0 ...
also, you might be able to achieve better results by using more than
one socket on the fast interface.
for example, if you want to use 4 sockets per interface
mpirun --mca btl_tcp_links 4 ...
Cheers,
Gilles

Post by Adam Sylvester
I am using Open MPI 2.1.0 on RHEL 7. My application has one unavoidable
pinch point where a large amount of data needs to be transferred (about

8 GB

Post by Adam Sylvester
of data needs to be both sent to and received all other ranks), and I'm
seeing worse performance than I would expect; this step has a major

impact

Post by Adam Sylvester
on my overall runtime. In the real application, I am using

MPI_Alltoall()

Post by Adam Sylvester
for this step, but for the purpose of a simple benchmark, I simplified to
simply do a single MPI_Send() / MPI_Recv() between two ranks of a 2 GB
buffer.
I'm running this in AWS with instances that have 10 Gbps connectivity in

the

Post by Adam Sylvester
same availability zone (according to tracepath, there are no hops between
them) and MTU set to 8801 bytes. Doing a non-MPI benchmark of sending

data

Post by Adam Sylvester
directly over TCP between these two instances, I reliably get around 4

Gbps.

Post by Adam Sylvester
Between these same two instances with MPI_Send() / MPI_Recv(), I reliably
get around 2.4 Gbps. This seems like a major performance degradation

for a

Post by Adam Sylvester
single MPI operation.
I compiled Open MPI 2.1.0 with gcc 4.9.1 and default settings. I'm
connecting between instances via ssh and using I assume TCP for the

actual

Post by Adam Sylvester
network transfer (I'm not setting any special command-line or

programmatic

Post by Adam Sylvester
mpirun -N 1 --bind-to none --hostfile hosts.txt my_app
Any advice on other things to test or compilation and/or runtime flags to
set would be much appreciated!
-Adam
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

George Bosilca

2017-07-09 16:52:12 UTC

Permalink

Adam,

You can also set btl_tcp_links to 2 or 3 to allow multiple connections
between peers, with a potential higher aggregate bandwidth.

George.

Post by Adam Sylvester
I am using Open MPI 2.1.0 on RHEL 7. My application has one unavoidable
pinch point where a large amount of data needs to be transferred (about

8 GB

Post by Adam Sylvester
of data needs to be both sent to and received all other ranks), and I'm
seeing worse performance than I would expect; this step has a major

impact

Post by Adam Sylvester
on my overall runtime. In the real application, I am using

MPI_Alltoall()

Post by Adam Sylvester
for this step, but for the purpose of a simple benchmark, I simplified

Post by Adam Sylvester
simply do a single MPI_Send() / MPI_Recv() between two ranks of a 2 GB
buffer.
I'm running this in AWS with instances that have 10 Gbps connectivity

in the

Post by Adam Sylvester
same availability zone (according to tracepath, there are no hops

between

Post by Adam Sylvester
them) and MTU set to 8801 bytes. Doing a non-MPI benchmark of sending

data

Post by Adam Sylvester
directly over TCP between these two instances, I reliably get around 4

Gbps.

Post by Adam Sylvester
Between these same two instances with MPI_Send() / MPI_Recv(), I

reliably

Post by Adam Sylvester
get around 2.4 Gbps. This seems like a major performance degradation

for a

Post by Adam Sylvester
single MPI operation.
I compiled Open MPI 2.1.0 with gcc 4.9.1 and default settings. I'm
connecting between instances via ssh and using I assume TCP for the

actual

Post by Adam Sylvester
network transfer (I'm not setting any special command-line or

programmatic

Post by Adam Sylvester
mpirun -N 1 --bind-to none --hostfile hosts.txt my_app
Any advice on other things to test or compilation and/or runtime flags

Post by Adam Sylvester
set would be much appreciated!
-Adam
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Gilles Gouaillardet

2017-07-10 04:04:56 UTC

Permalink

Adam,

Thanks for letting us know your performance issue has been resolved.

yes, https://www.open-mpi.org/faq/?category=tcp is the best place to
look for this kind of information.

i will add a reference to these parameters. i will also ask folks at AWS
if they have additional/other recommendations.

note you have a few options before 2.1.2 (or 3.0.0) is released :

- update your system wide config file (/.../etc/openmpi-mca-params.conf)
or user config file

($HOME/.openmpi/mca-params.conf) and add the following lines

btl_tcp_sndbuf = 0

btl_tcp_rcvbuf = 0

- add the following environment variable to your environment

export OMPI_MCA_btl_tcp_sndbuf=0

export OMPI_MCA_btl_tcp_rcvbuf=0

- use Open MPI 2.0.3

- last but not least, you can manually download and apply the patch
available at

https://github.com/open-mpi/ompi/commit/b64fedf4f652cadc9bfc7c4693f9c1ef01dfb69f.patch

Cheers,

Gilles

Post by Adam Sylvester
Gilles,
Thanks for the fast response!
The --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 flags you
recommended made a huge difference - this got me up to 5.7 Gb/s! I
wasn't aware of these flags... with a little Googling, is
https://www.open-mpi.org/faq/?category=tcp the best place to look for
this kind of information and any other tweaks I may want to try (or if
there's a better FAQ out there, please let me know)?
There is only eth0 on my machines so nothing to tweak there (though
good to know for the future). I also didn't see any improvement by
specifying more sockets per instance. But, your initial suggestion had
a major impact.
In general I try to stay relatively up to date with my Open MPI
version; I'll be extra motivated to upgrade to 2.1.2 so that I don't
have to remember to set these --mca flags on the command line. :o)
-Adam
On Sun, Jul 9, 2017 at 9:26 AM, Gilles Gouaillardet
Adam,
mpirun --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 ...
/* note this will be the default from Open MPI 2.1.2 */
hopefully, that will be enough to greatly improve the bandwidth for
large messages.
generally speaking, i recommend you use the latest (e.g. Open MPI
2.1.1) available version
how many interfaces can be used to communicate between hosts ?
if there is more than one (for example a slow and a fast one), you'd
rather only use the fast one.
for example, if eth0 is the fast interface, that can be achieved with
mpirun --mca btl_tcp_if_include eth0 ...
also, you might be able to achieve better results by using more than
one socket on the fast interface.
for example, if you want to use 4 sockets per interface
mpirun --mca btl_tcp_links 4 ...
Cheers,
Gilles

Post by Adam Sylvester
I am using Open MPI 2.1.0 on RHEL 7. My application has one

unavoidable

Post by Adam Sylvester
pinch point where a large amount of data needs to be transferred

(about 8 GB

Post by Adam Sylvester
of data needs to be both sent to and received all other ranks),

and I'm

Post by Adam Sylvester
seeing worse performance than I would expect; this step has a

major impact

Post by Adam Sylvester
on my overall runtime. In the real application, I am using

MPI_Alltoall()

Post by Adam Sylvester
for this step, but for the purpose of a simple benchmark, I

simplified to

Post by Adam Sylvester
simply do a single MPI_Send() / MPI_Recv() between two ranks of

a 2 GB

Post by Adam Sylvester
buffer.
I'm running this in AWS with instances that have 10 Gbps

connectivity in the

Post by Adam Sylvester
same availability zone (according to tracepath, there are no

hops between

Post by Adam Sylvester
them) and MTU set to 8801 bytes. Doing a non-MPI benchmark of

sending data

Post by Adam Sylvester
directly over TCP between these two instances, I reliably get

around 4 Gbps.

Post by Adam Sylvester
Between these same two instances with MPI_Send() / MPI_Recv(), I

reliably

Post by Adam Sylvester
get around 2.4 Gbps. This seems like a major performance

degradation for a

Post by Adam Sylvester
single MPI operation.
I compiled Open MPI 2.1.0 with gcc 4.9.1 and default settings. I'm
connecting between instances via ssh and using I assume TCP for

the actual

Post by Adam Sylvester
network transfer (I'm not setting any special command-line or

programmatic

Post by Adam Sylvester
mpirun -N 1 --bind-to none --hostfile hosts.txt my_app
Any advice on other things to test or compilation and/or runtime

flags to

Post by Adam Sylvester
set would be much appreciated!
-Adam
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Adam Sylvester

2017-07-11 01:31:34 UTC

Permalink

Thanks again Gilles. Ahh, better yet - I wasn't familiar with the config
file way to set these parameters... it'll be easy to bake this into my AMI
so that I don't have to set them each time while waiting for the next Open
MPI release.

Out of mostly laziness I try to keep to the formal releases rather than
applying patches myself, but thanks for the link to it (the commit comments
were useful to understand why this improved performance).

-Adam

Post by Gilles Gouaillardet
Adam,
Thanks for letting us know your performance issue has been resolved.
yes, https://www.open-mpi.org/faq/?category=tcp is the best place to look
for this kind of information.
i will add a reference to these parameters. i will also ask folks at AWS
if they have additional/other recommendations.
- update your system wide config file (/.../etc/openmpi-mca-params.conf)
or user config file
($HOME/.openmpi/mca-params.conf) and add the following lines
btl_tcp_sndbuf = 0
btl_tcp_rcvbuf = 0
- add the following environment variable to your environment
export OMPI_MCA_btl_tcp_sndbuf=0
export OMPI_MCA_btl_tcp_rcvbuf=0
- use Open MPI 2.0.3
- last but not least, you can manually download and apply the patch
available at
https://github.com/open-mpi/ompi/commit/b64fedf4f652cadc9bfc
7c4693f9c1ef01dfb69f.patch
Cheers,
Gilles

Post by Adam Sylvester
Gilles,
Thanks for the fast response!
The --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 flags you recommended
made a huge difference - this got me up to 5.7 Gb/s! I wasn't aware of
these flags... with a little Googling, is https://www.open-mpi.org/faq/?
category=tcp the best place to look for this kind of information and any
other tweaks I may want to try (or if there's a better FAQ out there,
please let me know)?
There is only eth0 on my machines so nothing to tweak there (though good
to know for the future). I also didn't see any improvement by specifying
more sockets per instance. But, your initial suggestion had a major impact.
In general I try to stay relatively up to date with my Open MPI version;
I'll be extra motivated to upgrade to 2.1.2 so that I don't have to
remember to set these --mca flags on the command line. :o)
-Adam
On Sun, Jul 9, 2017 at 9:26 AM, Gilles Gouaillardet <
Adam,
mpirun --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 ...
/* note this will be the default from Open MPI 2.1.2 */
hopefully, that will be enough to greatly improve the bandwidth for
large messages.
generally speaking, i recommend you use the latest (e.g. Open MPI
2.1.1) available version
how many interfaces can be used to communicate between hosts ?
if there is more than one (for example a slow and a fast one), you'd
rather only use the fast one.
for example, if eth0 is the fast interface, that can be achieved with
mpirun --mca btl_tcp_if_include eth0 ...
also, you might be able to achieve better results by using more than
one socket on the fast interface.
for example, if you want to use 4 sockets per interface
mpirun --mca btl_tcp_links 4 ...
Cheers,
Gilles

Post by Adam Sylvester
I am using Open MPI 2.1.0 on RHEL 7. My application has one

unavoidable

Post by Adam Sylvester
pinch point where a large amount of data needs to be transferred

(about 8 GB

Post by Adam Sylvester
of data needs to be both sent to and received all other ranks),

and I'm

Post by Adam Sylvester
seeing worse performance than I would expect; this step has a

major impact

Post by Adam Sylvester
on my overall runtime. In the real application, I am using

MPI_Alltoall()

Post by Adam Sylvester
for this step, but for the purpose of a simple benchmark, I

simplified to

Post by Adam Sylvester
simply do a single MPI_Send() / MPI_Recv() between two ranks of

a 2 GB

Post by Adam Sylvester
buffer.
I'm running this in AWS with instances that have 10 Gbps

connectivity in the

Post by Adam Sylvester
same availability zone (according to tracepath, there are no

hops between

Post by Adam Sylvester
them) and MTU set to 8801 bytes. Doing a non-MPI benchmark of

sending data

Post by Adam Sylvester
directly over TCP between these two instances, I reliably get

around 4 Gbps.

Post by Adam Sylvester
Between these same two instances with MPI_Send() / MPI_Recv(), I

reliably

Post by Adam Sylvester
get around 2.4 Gbps. This seems like a major performance

degradation for a

Post by Adam Sylvester
single MPI operation.
I compiled Open MPI 2.1.0 with gcc 4.9.1 and default settings. I'm
connecting between instances via ssh and using I assume TCP for

the actual

Post by Adam Sylvester
network transfer (I'm not setting any special command-line or

programmatic

Post by Adam Sylvester
mpirun -N 1 --bind-to none --hostfile hosts.txt my_app
Any advice on other things to test or compilation and/or runtime

flags to

Post by Adam Sylvester
set would be much appreciated!
-Adam
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Adam Sylvester

2017-07-12 16:44:23 UTC

Permalink

I switched over to X1 instances in AWS which have 20 Gbps connectivity.
Using iperf3, I'm seeing 11.1 Gbps between them with just one port. iperf3
supports a -P option which will connect using multiple ports... Setting
this to use in the range of 5-20 ports (there's some variability from run
to run), I can get in the range of 18 Gbps aggregate which for a real world
speed seems pretty good.

Using mpirun with the previously-suggested btl_tcp_sndbuf and
btl_tcp_rcvbuf settings, I'm getting around 10.7 Gbps. So, pretty close to
iperf with just one port (makes sense there'd be some overhead with MPI).
My understanding of the btl_tcp_links flag that Gilles mentioned is that it
should be analogous to iperf's -P flag - it should connect with multiple
ports in the hopes of improving the aggregate bandwidth.

If that's what this flag is supposed to do, it does not appear to be
working properly for me. With lsof, I can see the expected number of ports
show up when I run iperf. However, with MPI I only ever see three
connections between the two machines - sshd, orted, and my actual
application. No matter what I set btl_tcp_links to, I don't see any
additional ports show up (or any change in performance).

Am I misunderstanding what this flag does or is there a bug here? If I am
misunderstanding the flag's intent, is there a different flag that would
allow Open MPI to use multiple ports similar to what iperf is doing?

Thanks.
-Adam

Post by Adam Sylvester
Thanks again Gilles. Ahh, better yet - I wasn't familiar with the config
file way to set these parameters... it'll be easy to bake this into my AMI
so that I don't have to set them each time while waiting for the next Open
MPI release.
Out of mostly laziness I try to keep to the formal releases rather than
applying patches myself, but thanks for the link to it (the commit comments
were useful to understand why this improved performance).
-Adam

Post by Gilles Gouaillardet
Adam,
Thanks for letting us know your performance issue has been resolved.
yes, https://www.open-mpi.org/faq/?category=tcp is the best place to
look for this kind of information.
i will add a reference to these parameters. i will also ask folks at AWS
if they have additional/other recommendations.
- update your system wide config file (/.../etc/openmpi-mca-params.conf)
or user config file
($HOME/.openmpi/mca-params.conf) and add the following lines
btl_tcp_sndbuf = 0
btl_tcp_rcvbuf = 0
- add the following environment variable to your environment
export OMPI_MCA_btl_tcp_sndbuf=0
export OMPI_MCA_btl_tcp_rcvbuf=0
- use Open MPI 2.0.3
- last but not least, you can manually download and apply the patch
available at
https://github.com/open-mpi/ompi/commit/b64fedf4f652cadc9bfc
7c4693f9c1ef01dfb69f.patch
Cheers,
Gilles

Post by Adam Sylvester
Gilles,
Thanks for the fast response!
The --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 flags you recommended
made a huge difference - this got me up to 5.7 Gb/s! I wasn't aware of
these flags... with a little Googling, is https://www.open-mpi.org/faq/?
category=tcp the best place to look for this kind of information and
any other tweaks I may want to try (or if there's a better FAQ out there,
please let me know)?
There is only eth0 on my machines so nothing to tweak there (though good
to know for the future). I also didn't see any improvement by specifying
more sockets per instance. But, your initial suggestion had a major impact.
In general I try to stay relatively up to date with my Open MPI version;
I'll be extra motivated to upgrade to 2.1.2 so that I don't have to
remember to set these --mca flags on the command line. :o)
-Adam
On Sun, Jul 9, 2017 at 9:26 AM, Gilles Gouaillardet <
Adam,
mpirun --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 ...
/* note this will be the default from Open MPI 2.1.2 */
hopefully, that will be enough to greatly improve the bandwidth for
large messages.
generally speaking, i recommend you use the latest (e.g. Open MPI
2.1.1) available version
how many interfaces can be used to communicate between hosts ?
if there is more than one (for example a slow and a fast one), you'd
rather only use the fast one.
for example, if eth0 is the fast interface, that can be achieved with
mpirun --mca btl_tcp_if_include eth0 ...
also, you might be able to achieve better results by using more than
one socket on the fast interface.
for example, if you want to use 4 sockets per interface
mpirun --mca btl_tcp_links 4 ...
Cheers,
Gilles

Post by Adam Sylvester
I am using Open MPI 2.1.0 on RHEL 7. My application has one

unavoidable

Post by Adam Sylvester
pinch point where a large amount of data needs to be transferred

(about 8 GB

Post by Adam Sylvester
of data needs to be both sent to and received all other ranks),

and I'm

Post by Adam Sylvester
seeing worse performance than I would expect; this step has a

major impact

Post by Adam Sylvester
on my overall runtime. In the real application, I am using

MPI_Alltoall()

Post by Adam Sylvester
for this step, but for the purpose of a simple benchmark, I

simplified to

Post by Adam Sylvester
simply do a single MPI_Send() / MPI_Recv() between two ranks of

a 2 GB

Post by Adam Sylvester
buffer.
I'm running this in AWS with instances that have 10 Gbps

connectivity in the

Post by Adam Sylvester
same availability zone (according to tracepath, there are no

hops between

Post by Adam Sylvester
them) and MTU set to 8801 bytes. Doing a non-MPI benchmark of

sending data

Post by Adam Sylvester
directly over TCP between these two instances, I reliably get

around 4 Gbps.

Post by Adam Sylvester
Between these same two instances with MPI_Send() / MPI_Recv(), I

reliably

Post by Adam Sylvester
get around 2.4 Gbps. This seems like a major performance

degradation for a

Post by Adam Sylvester
single MPI operation.
I compiled Open MPI 2.1.0 with gcc 4.9.1 and default settings. I'm
connecting between instances via ssh and using I assume TCP for

the actual

Post by Adam Sylvester
network transfer (I'm not setting any special command-line or

programmatic

Post by Adam Sylvester
mpirun -N 1 --bind-to none --hostfile hosts.txt my_app
Any advice on other things to test or compilation and/or runtime

flags to

Post by Adam Sylvester
set would be much appreciated!
-Adam
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Barrett, Brian via users

2017-07-12 18:18:54 UTC

Permalink

Adam -

The btl_tcp_links flag does not currently work (for various reasons) in the 2.x and 3.x series. Itâs on my todo list to fix, but Iâm not sure it will get done before the 3.0.0 release. Part of the reason that it hasnât been a priority is that most applications (outside of benchmarks) donât benefit from the 20 Gbps between rank pairs, as they are generally talking to multiple peers at once (and therefore can drive the full 20 Gbps). Itâs definitely on our roadmap, but canât promise a release just yet.

Brian

On Jul 12, 2017, at 11:44 AM, Adam Sylvester <***@gmail.com<mailto:***@gmail.com>> wrote:

I switched over to X1 instances in AWS which have 20 Gbps connectivity. Using iperf3, I'm seeing 11.1 Gbps between them with just one port. iperf3 supports a -P option which will connect using multiple ports... Setting this to use in the range of 5-20 ports (there's some variability from run to run), I can get in the range of 18 Gbps aggregate which for a real world speed seems pretty good.

Using mpirun with the previously-suggested btl_tcp_sndbuf and btl_tcp_rcvbuf settings, I'm getting around 10.7 Gbps. So, pretty close to iperf with just one port (makes sense there'd be some overhead with MPI). My understanding of the btl_tcp_links flag that Gilles mentioned is that it should be analogous to iperf's -P flag - it should connect with multiple ports in the hopes of improving the aggregate bandwidth.

If that's what this flag is supposed to do, it does not appear to be working properly for me. With lsof, I can see the expected number of ports show up when I run iperf. However, with MPI I only ever see three connections between the two machines - sshd, orted, and my actual application. No matter what I set btl_tcp_links to, I don't see any additional ports show up (or any change in performance).

Am I misunderstanding what this flag does or is there a bug here? If I am misunderstanding the flag's intent, is there a different flag that would allow Open MPI to use multiple ports similar to what iperf is doing?

Thanks.
-Adam

On Mon, Jul 10, 2017 at 9:31 PM, Adam Sylvester <***@gmail.com<mailto:***@gmail.com>> wrote:
Thanks again Gilles. Ahh, better yet - I wasn't familiar with the config file way to set these parameters... it'll be easy to bake this into my AMI so that I don't have to set them each time while waiting for the next Open MPI release.

Out of mostly laziness I try to keep to the formal releases rather than applying patches myself, but thanks for the link to it (the commit comments were useful to understand why this improved performance).

-Adam

On Mon, Jul 10, 2017 at 12:04 AM, Gilles Gouaillardet <***@rist.or.jp<mailto:***@rist.or.jp>> wrote:
Adam,

Thanks for letting us know your performance issue has been resolved.

yes, https://www.open-mpi.org/faq/?category=tcp is the best place to look for this kind of information.

i will add a reference to these parameters. i will also ask folks at AWS if they have additional/other recommendations.

note you have a few options before 2.1.2 (or 3.0.0) is released :

- update your system wide config file (/.../etc/openmpi-mca-params.conf) or user config file

($HOME/.openmpi/mca-params.conf) and add the following lines

btl_tcp_sndbuf = 0

btl_tcp_rcvbuf = 0

- add the following environment variable to your environment

export OMPI_MCA_btl_tcp_sndbuf=0

export OMPI_MCA_btl_tcp_rcvbuf=0

- use Open MPI 2.0.3

- last but not least, you can manually download and apply the patch available at

https://github.com/open-mpi/ompi/commit/b64fedf4f652cadc9bfc7c4693f9c1ef01dfb69f.patch

Cheers,

Gilles

On 7/9/2017 11:04 PM, Adam Sylvester wrote:
Gilles,

Thanks for the fast response!

The --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 flags you recommended made a huge difference - this got me up to 5.7 Gb/s! I wasn't aware of these flags... with a little Googling, is https://www.open-mpi.org/faq/?category=tcp the best place to look for this kind of information and any other tweaks I may want to try (or if there's a better FAQ out there, please let me know)?
There is only eth0 on my machines so nothing to tweak there (though good to know for the future). I also didn't see any improvement by specifying more sockets per instance. But, your initial suggestion had a major impact.
In general I try to stay relatively up to date with my Open MPI version; I'll be extra motivated to upgrade to 2.1.2 so that I don't have to remember to set these --mca flags on the command line. :o)
-Adam

On Sun, Jul 9, 2017 at 9:26 AM, Gilles Gouaillardet <***@gmail.com<mailto:***@gmail.com> <mailto:***@gmail.com<mailto:***@gmail.com>>> wrote:

Adam,

at first, you need to change the default send and receive socket
buffers :
mpirun --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 ...
/* note this will be the default from Open MPI 2.1.2 */

hopefully, that will be enough to greatly improve the bandwidth for
large messages.

generally speaking, i recommend you use the latest (e.g. Open MPI
2.1.1) available version

how many interfaces can be used to communicate between hosts ?
if there is more than one (for example a slow and a fast one), you'd
rather only use the fast one.
for example, if eth0 is the fast interface, that can be achieved with
mpirun --mca btl_tcp_if_include eth0 ...

also, you might be able to achieve better results by using more than
one socket on the fast interface.
for example, if you want to use 4 sockets per interface
mpirun --mca btl_tcp_links 4 ...

Cheers,

Gilles

Post by Adam Sylvester
I am using Open MPI 2.1.0 on RHEL 7. My application has one

unavoidable

Post by Adam Sylvester
pinch point where a large amount of data needs to be transferred (about 8 GB
of data needs to be both sent to and received all other ranks), and I'm
seeing worse performance than I would expect; this step has a

major impact

Post by Adam Sylvester
on my overall runtime. In the real application, I am using

MPI_Alltoall()

Post by Adam Sylvester
for this step, but for the purpose of a simple benchmark, I

simplified to

Post by Adam Sylvester
simply do a single MPI_Send() / MPI_Recv() between two ranks of a 2 GB
buffer.
I'm running this in AWS with instances that have 10 Gbps

connectivity in the

Post by Adam Sylvester
same availability zone (according to tracepath, there are no

hops between

Post by Adam Sylvester
them) and MTU set to 8801 bytes. Doing a non-MPI benchmark of

sending data

Post by Adam Sylvester
directly over TCP between these two instances, I reliably get

around 4 Gbps.

Post by Adam Sylvester
Between these same two instances with MPI_Send() / MPI_Recv(), I reliably
get around 2.4 Gbps. This seems like a major performance

degradation for a

Post by Adam Sylvester
single MPI operation.
I compiled Open MPI 2.1.0 with gcc 4.9.1 and default settings. I'm
connecting between instances via ssh and using I assume TCP for the actual
network transfer (I'm not setting any special command-line or

programmatic

<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org> <mailto:***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>

_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Adam Sylvester

2017-07-13 16:05:54 UTC

Permalink

Bummer - thanks for the info Brian.

As an FYI, I do have a real world use case for this faster connectivity
(i.e. beyond just a benchmark). While my application will happily gobble
up and run on however many machines it's given, there's a resource manager
that lives on top of everything that doles out machines to applications.
So there will be cases where my application will only get two machines to
run and so I'd still like the big data transfers to happen as quickly as
possible. I agree that when there are many ranks all talking to each
other, I should hopefully get closer to the full 20 Gbps.

I appreciate that you have a number of other higher priorities, but wanted
to make you aware that I do have a use case for it... look forward to using
it when it's in place. :o)

On Wed, Jul 12, 2017 at 2:18 PM, Barrett, Brian via users <

Post by Barrett, Brian via users
Adam -
The btl_tcp_links flag does not currently work (for various reasons) in
the 2.x and 3.x series. Itâs on my todo list to fix, but Iâm not sure it
will get done before the 3.0.0 release. Part of the reason that it hasnât
been a priority is that most applications (outside of benchmarks) donât
benefit from the 20 Gbps between rank pairs, as they are generally talking
to multiple peers at once (and therefore can drive the full 20 Gbps). Itâs
definitely on our roadmap, but canât promise a release just yet.
Brian
I switched over to X1 instances in AWS which have 20 Gbps connectivity.
Using iperf3, I'm seeing 11.1 Gbps between them with just one port. iperf3
supports a -P option which will connect using multiple ports... Setting
this to use in the range of 5-20 ports (there's some variability from run
to run), I can get in the range of 18 Gbps aggregate which for a real world
speed seems pretty good.
Using mpirun with the previously-suggested btl_tcp_sndbuf and
btl_tcp_rcvbuf settings, I'm getting around 10.7 Gbps. So, pretty close to
iperf with just one port (makes sense there'd be some overhead with MPI).
My understanding of the btl_tcp_links flag that Gilles mentioned is that it
should be analogous to iperf's -P flag - it should connect with multiple
ports in the hopes of improving the aggregate bandwidth.
If that's what this flag is supposed to do, it does not appear to be
working properly for me. With lsof, I can see the expected number of ports
show up when I run iperf. However, with MPI I only ever see three
connections between the two machines - sshd, orted, and my actual
application. No matter what I set btl_tcp_links to, I don't see any
additional ports show up (or any change in performance).
Am I misunderstanding what this flag does or is there a bug here? If I am
misunderstanding the flag's intent, is there a different flag that would
allow Open MPI to use multiple ports similar to what iperf is doing?
Thanks.
-Adam

Post by Gilles Gouaillardet
Adam,
Thanks for letting us know your performance issue has been resolved.
yes, https://www.open-mpi.org/faq/?category=tcp is the best place to
look for this kind of information.
i will add a reference to these parameters. i will also ask folks at AWS
if they have additional/other recommendations.
- update your system wide config file (/.../etc/openmpi-mca-params.conf)
or user config file
($HOME/.openmpi/mca-params.conf) and add the following lines
btl_tcp_sndbuf = 0
btl_tcp_rcvbuf = 0
- add the following environment variable to your environment
export OMPI_MCA_btl_tcp_sndbuf=0
export OMPI_MCA_btl_tcp_rcvbuf=0
- use Open MPI 2.0.3
- last but not least, you can manually download and apply the patch available at
https://github.com/open-mpi/ompi/commit/b64fedf4f652cadc9bfc
7c4693f9c1ef01dfb69f.patch
Cheers,
Gilles

Post by Adam Sylvester
Gilles,
Thanks for the fast response!
The --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 flags you recommended
made a huge difference - this got me up to 5.7 Gb/s! I wasn't aware of
these flags... with a little Googling, is
https://www.open-mpi.org/faq/?category=tcp the best place to look for
this kind of information and any other tweaks I may want to try (or if
there's a better FAQ out there, please let me know)?
There is only eth0 on my machines so nothing to tweak there (though
good to know for the future). I also didn't see any improvement by
specifying more sockets per instance. But, your initial suggestion had a
major impact.
In general I try to stay relatively up to date with my Open MPI
version; I'll be extra motivated to upgrade to 2.1.2 so that I don't have
to remember to set these --mca flags on the command line. :o)
-Adam
On Sun, Jul 9, 2017 at 9:26 AM, Gilles Gouaillardet <
Adam,
mpirun --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 ...
/* note this will be the default from Open MPI 2.1.2 */
hopefully, that will be enough to greatly improve the bandwidth for
large messages.
generally speaking, i recommend you use the latest (e.g. Open MPI
2.1.1) available version
how many interfaces can be used to communicate between hosts ?
if there is more than one (for example a slow and a fast one), you'd
rather only use the fast one.
for example, if eth0 is the fast interface, that can be achieved with
mpirun --mca btl_tcp_if_include eth0 ...
also, you might be able to achieve better results by using more than
one socket on the fast interface.
for example, if you want to use 4 sockets per interface
mpirun --mca btl_tcp_links 4 ...
Cheers,
Gilles

Post by Adam Sylvester
I am using Open MPI 2.1.0 on RHEL 7. My application has one

unavoidable

Post by Adam Sylvester
pinch point where a large amount of data needs to be transferred

(about 8 GB

Post by Adam Sylvester
of data needs to be both sent to and received all other ranks),

and I'm

Post by Adam Sylvester
seeing worse performance than I would expect; this step has a

major impact

Post by Adam Sylvester
on my overall runtime. In the real application, I am using

MPI_Alltoall()

Post by Adam Sylvester
for this step, but for the purpose of a simple benchmark, I

simplified to

Post by Adam Sylvester
simply do a single MPI_Send() / MPI_Recv() between two ranks of

a 2 GB

Post by Adam Sylvester
buffer.
I'm running this in AWS with instances that have 10 Gbps

connectivity in the

Post by Adam Sylvester
same availability zone (according to tracepath, there are no

hops between

Post by Adam Sylvester
them) and MTU set to 8801 bytes. Doing a non-MPI benchmark of

sending data

Post by Adam Sylvester
directly over TCP between these two instances, I reliably get

around 4 Gbps.

Post by Adam Sylvester
Between these same two instances with MPI_Send() / MPI_Recv(), I

reliably

Post by Adam Sylvester
get around 2.4 Gbps. This seems like a major performance

degradation for a

Post by Adam Sylvester
single MPI operation.
I compiled Open MPI 2.1.0 with gcc 4.9.1 and default settings.

I'm

Post by Adam Sylvester
connecting between instances via ssh and using I assume TCP for

the actual

Post by Adam Sylvester
network transfer (I'm not setting any special command-line or

programmatic

Post by Adam Sylvester
mpirun -N 1 --bind-to none --hostfile hosts.txt my_app
Any advice on other things to test or compilation and/or runtime

flags to

Post by Adam Sylvester
set would be much appreciated!
-Adam
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users