Discussion:
[OMPI users] Bandwidth efficiency advice
marcin.krotkiewski
2017-05-26 10:43:39 UTC
Permalink
Dear All,

I would appreciate some general advice on how to efficiently implement
the following scenario.

I am looking into how to send a large amount of data over IB _once_, to
multiple receivers. The trick is, of course, that while the ping-pong
benchmark delivers great bandwidth, it does so by re-using the already
registered memory buffers. Since I need to send the data once, the
memory registration penalty is not easily avoided. I've been looking
into the following approaches:

1. have multiple ranks send different parts of the data to different
receivers, in the hope that the memory registration cost will be hidden
2. pre-register two smaller buffers, into which a data is copied before
sending

The first approach is the best I've managed so far, but the bandwidth
reached is still lower than what I observe using the pingpong benchmark.
Also, the performance depends on the number of sending ranks and drops
if there are too many.

In the second approach one pays for a data copy. My thinking was that
since the effective memory bandwidth available on a single modern CPU is
larger than the IB bandwidth, I could squeeze out some performance by
combining double buffering and multithreading, e.g.,

Step 1. thread A sends the data in the current buffer. Behind the
scenes, thread B copies data from memory to the next buffer
Step 2. buffers are switched

A similar idea would be to use MPI_Get on the remote rank. The sender
would copy the data from the memory to the second buffer while the RMA
window with the first buffer is exposed. In theory, I would expect those
two operations to be executed simultaneously, with the memory copy
hopefully hidden behind the IB transfer.

Of course, the experiments didn't really work. While the first
(multi-rank) approach is OK and shows some improvement, the bandwidth
could still be improved. None of my double-buffering approaches worked
at all, possibly because memory bandwidth contention.

So I was wondering, has any of you had any experience with similar
approaches? In your experience, what would be the best approach?

Thanks a lot!

Marcin
George Bosilca
2017-05-26 13:13:33 UTC
Permalink
If you have multiple receivers then use MPI_Bcast, it does all the
necessary optimizations such that MPI users do not have to struggle to
adapt/optimize their application for a specific architecture/network.

George.



On Fri, May 26, 2017 at 6:43 AM, marcin.krotkiewski <
Post by marcin.krotkiewski
Dear All,
I would appreciate some general advice on how to efficiently implement the
following scenario.
I am looking into how to send a large amount of data over IB _once_, to
multiple receivers. The trick is, of course, that while the ping-pong
benchmark delivers great bandwidth, it does so by re-using the already
registered memory buffers. Since I need to send the data once, the memory
registration penalty is not easily avoided. I've been looking into the
1. have multiple ranks send different parts of the data to different
receivers, in the hope that the memory registration cost will be hidden
2. pre-register two smaller buffers, into which a data is copied before
sending
The first approach is the best I've managed so far, but the bandwidth
reached is still lower than what I observe using the pingpong benchmark.
Also, the performance depends on the number of sending ranks and drops if
there are too many.
In the second approach one pays for a data copy. My thinking was that
since the effective memory bandwidth available on a single modern CPU is
larger than the IB bandwidth, I could squeeze out some performance by
combining double buffering and multithreading, e.g.,
Step 1. thread A sends the data in the current buffer. Behind the scenes,
thread B copies data from memory to the next buffer
Step 2. buffers are switched
A similar idea would be to use MPI_Get on the remote rank. The sender
would copy the data from the memory to the second buffer while the RMA
window with the first buffer is exposed. In theory, I would expect those
two operations to be executed simultaneously, with the memory copy
hopefully hidden behind the IB transfer.
Of course, the experiments didn't really work. While the first
(multi-rank) approach is OK and shows some improvement, the bandwidth could
still be improved. None of my double-buffering approaches worked at all,
possibly because memory bandwidth contention.
So I was wondering, has any of you had any experience with similar
approaches? In your experience, what would be the best approach?
Thanks a lot!
Marcin
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Loading...