[OMPI users] MPI_Bcast vs. per worker MPI

Discussion:

[OMPI users] MPI_Bcast vs. per worker MPI_Send?

David Mathog

2010-12-13 23:50:52 UTC

Is there a rule of thumb for when it is best to contact N workers with
MPI_Bcast vs. when it is best to use a loop which cycles N times and
moves the same information with MPI_Send to one worker at a time?

For that matter, other than the coding semantics, is there any real
difference between the two approaches? That is, does MPI_Bcast really
broadcast, daisy chain, or use other similar methods to reduce bandwidth
use when distributing its message, or does it just go ahead and run
MPI_Send in a loop anyway, but hide the details from the programmer?

Thanks,

David Mathog
***@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

Eugene Loh

2010-12-14 00:26:15 UTC

Permalink

Post by David Mathog
Is there a rule of thumb for when it is best to contact N workers with
MPI_Bcast vs. when it is best to use a loop which cycles N times and
moves the same information with MPI_Send to one worker at a time?

The rule of thumb is to use a collective whenever you can. The
rationale is that the programming should be easier/cleaner and the
underlying MPI implementation has the opportunity to do something clever.

Post by David Mathog
For that matter, other than the coding semantics, is there any real
difference between the two approaches? That is, does MPI_Bcast really
broadcast, daisy chain, or use other similar methods to reduce bandwidth
use when distributing its message, or does it just go ahead and run
MPI_Send in a loop anyway, but hide the details from the programmer?

I believe most MPI implementations, including OMPI, make an attempt to
"do the right thing". Multiple algorithms are available and the best
one is chosen based on run-time conditions.

With any luck, you're better off with collective calls. Of course,
there are no guarantees.

David Zhang

2010-12-14 01:21:13 UTC

Permalink

Unless your cluster has some weird connection topology and you're trying to
take advantage of that, collective is the best bet.

Post by David Mathog
Is there a rule of thumb for when it is best to contact N workers with

Post by David Mathog
MPI_Bcast vs. when it is best to use a loop which cycles N times and
moves the same information with MPI_Send to one worker at a time?

The rule of thumb is to use a collective whenever you can. The rationale
is that the programming should be easier/cleaner and the underlying MPI
implementation has the opportunity to do something clever.
For that matter, other than the coding semantics, is there any real

Post by David Mathog
difference between the two approaches? That is, does MPI_Bcast really
broadcast, daisy chain, or use other similar methods to reduce bandwidth
use when distributing its message, or does it just go ahead and run
MPI_Send in a loop anyway, but hide the details from the programmer?
I believe most MPI implementations, including OMPI, make an attempt to

"do the right thing". Multiple algorithms are available and the best one is
chosen based on run-time conditions.
With any luck, you're better off with collective calls. Of course, there
are no guarantees.
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users

--
David Zhang
University of California, San Diego

David Mathog

2010-12-14 16:54:10 UTC

Permalink

So the 2/2 consensus is to use the collective. That is straightforward
for the send part of this, since all workers are sent the same data.

For the receive I do not see how to use a collective. Each worker sends
back a data structure, and the structures are of of varying size. This
is almost always the case in Bioinformatics, where what is usually
coming back from each worker is a count M of the number of significant
results, M x (fixed size data per result: scores and the like), and M x
sequences or sequence alignments. M runs from 0 to Z, where in
pathological cases, Z is a very large number, and the size of the
sequences or alignments returned also varies.

The current code on the master does within a loop over the N workers:

MPI_probe
MPI_Get_Count
MPI_Receive
unpack received data into a result structure
set a pointer in an array of length N to this result

So MPI_gather isn't going to do. Possibly MPI_gatherv would, but we
cannot know ahead of time how big the largest result is going to be,
which makes preallocating memory difficult.

Is there by any chance an "MPI_Get_Counts" (a collective form of
MPI_Get_Count)? That would let the preceding loop be replaced by

MPI_Get_Counts
(allocate memory as needed)
MPI_Gatherv

although I guess even that wouldn't be very efficient with memory,
because there would usually be huge holes in the recv buffer.

Thanks,

David Mathog
***@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

Eugene Loh

2010-12-14 17:03:21 UTC

Permalink

Post by David Mathog
For the receive I do not see how to use a collective. Each worker sends
back a data structure, and the structures are of of varying size. This
is almost always the case in Bioinformatics, where what is usually
coming back from each worker is a count M of the number of significant
results, M x (fixed size data per result: scores and the like), and M x
sequences or sequence alignments. M runs from 0 to Z, where in
pathological cases, Z is a very large number, and the size of the
sequences or alignments returned also varies.

A collective call might not make sense in this case.

Arguably, each process could first send a size message (how much stuff
is coming) and then the actual data. In this case, you could do an
MPI_Gather, master could allocate space, and then you do an MPI_Gatherv.

But it may make more sense for you to stick to your point-to-point
implementation. It may allow the master to operate with a smaller
footprint and it may allow first finishers to send their results back
earlier without everyone waiting for laggards.