[OMPI users] Custom datatype with variable length array

Jeff Hammond

2018-01-17 00:05:34 UTC

Post by Florian Lindner
I have a custom datatype MPI_EVENTDATA (created with
MPI_Type_create_struct) which is a struct with some fixed size fields and a
variable sized array of ints (data). I want to collect a variable number of
these types (Events) from all ranks at rank 0. My current version is

MPI_EVENTDATA can only support one definition, so if you want to support a
varying width .data member, each width has to be a different datatype.

An alternative solution is to define the struct such that all of the bytes
are contiguous and just send them as MPI_BYTE with varying count. If the
variation in width is relatively narrow, you could define a single datatype
that every instance fits into. There are a bunch of advantages of the
latter, not the least of which is you can use a O(log N) MPI_Gather instead
of O(N) Send-Recvs or MPI_Gatherv, which may be O(N) internally.

Post by Florian Lindner
void collect()
{
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
size_t globalSize = events.size();
// Get total number of Events that are to be received
MPI_Allreduce(MPI_IN_PLACE, &globalSize, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
std::vector<MPI_Request> requests(globalSize);
std::vector<MPI_EventData> recvEvents(globalSize);
if (rank == 0) {
for (size_t i = 0; i < globalSize; i++) {
MPI_Irecv(&recvEvents[i], 1, MPI_EVENTDATA, MPI_ANY_SOURCE,
MPI_ANY_TAG, MPI_COMM_WORLD, &requests[i]);
}

MPI wildcards are the least efficient option. If you know how many
messages are going to be sent and from which ranks, you can eliminate the
wildcards. If not every rank will send data, then augment MPI_Allreduce
with an MPI_Gather of 0 or 1 and post a recv for every 1 in the output
vector. (You can elide the MPI_Allreduce if you only need globalSize at
the root.) If you know the counts and are sending either arbitrary bytes
or a single user-defined datatype, you can replace the Send + N*Recv with
MPI_Gatherv. Even if it's still O(N), a decent implementation will handle
flow control for you. Posting N receives doesn't scale and you will want
to batch up into some reasonable number if you are going to run large
jobs. Flasjik, Dinan, and Underwood have an ISC16 paper on this.

Post by Florian Lindner
}
for (const auto & ev : events) {
MPI_EventData eventdata;
assert(ev.first.size() < 255);
strcpy(eventdata.name, ev.first.c_str());
eventdata.rank = rank;
eventdata.dataSize = ev.second.data.size();
MPI_Send(&eventdata, 1, MPI_EVENTDATA, 0, 0, MPI_COMM_WORLD);
}
if (rank == 0) {
MPI_Waitall(globalSize, requests.data(), MPI_STATUSES_IGNORE);
for (const auto & evdata : recvEvents) {
// Save in a std::multimap with evdata.name as key
globalEvents.emplace(std::piecewise_construct,
std::forward_as_tuple(evdata.name),
std::forward_as_tuple(evdata.name,
evdata.rank));
}
}
}
Obviously, next step would be to allocate a buffer of size
evdata.dataSize, receive it, add it to globalEvents multimap<Event> and be
* How to correlate the received Events in the first step, with the
received data vector in the second step?

You can eliminate the first phase in favor of DSDE (
http://htor.inf.ethz.ch/publications/index.php?pub=99), which replaces the
MPI_Allreduce with Probe+Ibarrier (not literally - see paper for details).

Post by Florian Lindner
* Is there a way to use a variable sized compononent inside a custom MPI datatype?

No. You can use MPI_Type_create_resized to avoid creating a new datatype,
but in your case, you'll be resizing the contiguous type inside of your
struct type, which probably doesn't save you anything.

Post by Florian Lindner
* Or dump the custom datatype and use MPI_Pack instead?

Writing your own serialization is likely to be faster. MPI doesn't have
any magic here and the generic implementation inside of an MPI library
can't leverage the specifics in your code that may be amenable to compiler
optimization.

You might also look at Boost.MPI, which plays nicely with
Boost.Serialization (
http://www.boost.org/doc/libs/1_54_0/doc/html/mpi/tutorial.html). While
Boost.MPI does not support recent features of MPI, it supports the most
widely used ones, perhaps all of the ones you need.

If you refuse to Boost for whatever reason, I am sympathetic, since this
was my position back when I had to use less-than-awesome compilers.

Post by Florian Lindner
* Or somehow group two succeeding messages together?

DSDE is probably the best generic implementation of what you are doing.
It's possible that a two-phase implementation wins when the specific usage
allows you to use a more efficient collective algorithm.

Post by Florian Lindner
I'm open to any good and elegant suggestions!

I won't guarentee that any of my suggestions satisfied either property :-)

Best,

Jeff

--
Jeff Hammond
***@gmail.com
http://jeffhammond.github.io/