[OMPI users] MPI_Accumulate() Blocking?

Discussion:

Benjamin Brock

2017-05-03 23:25:43 UTC

MPI_Accumulate() is meant to be non-blocking, and MPI will block until
completion when an MPI_Win_flush() is called, correct?

In this (https://hastebin.com/raw/iwakacadey) microbenchmark,
MPI_Accumulate() seems to be blocking for me in OpenMPI 1.10.6.

I'm seeing timings like

[***@nid00622 junk]$ mpirun -n 4 ./junk
Write: 0.499229 rq, 0.000018 fl; Read: 0.463764 rq, 0.000035 fl
Write: 0.464914 rq, 0.000012 fl; Read: 0.419703 rq, 0.000024 fl
Write: 0.499686 rq, 0.000014 fl; Read: 0.422557 rq, 0.000023 fl
Write: 0.437960 rq, 0.000015 fl; Read: 0.396530 rq, 0.000023 fl

Meaning up to half a second is being spent issuing requests, but almost no
time is spent in flushes. The time spent in requests scales with the size
of the messages, but the time spent in flushes stays the same.

I'm compiling this with mpicxx acc.cpp -o acc -std=gnu++11 -O3.

Any suggestions? Am I using this incorrectly?

Ben

Marc-André Hermanns

2017-05-04 12:53:03 UTC

Permalink

Dear Benjamin,

as far as I understand the MPI standard, RMA operations non-blocking
in the sense that you need to complete them with a separate call
(flush/unlock/...).

I cannot find the place in the standard right now, but I think an
implementation is allowed to either buffer RMA requests or block until
the RMA operation can be initiated, and the user should not assume
either. I have seen the one and the other behavior across
implementations in the past.

For your second question, yes, flush is supposed to block until remote
completion of the operation.

That said, I think to recall that Open-MPI 1.x did not support
asynchronous target-side progress for passive-target synchronization
(which is used in your benchmark example), so the behavior you
observed is to some extent expected.

Cheers,
Marc-Andre

Post by Benjamin Brock
MPI_Accumulate() is meant to be non-blocking, and MPI will block until
completion when an MPI_Win_flush() is called, correct?
In this (https://hastebin.com/raw/iwakacadey) microbenchmark,
MPI_Accumulate() seems to be blocking for me in OpenMPI 1.10.6.
I'm seeing timings like
Write: 0.499229 rq, 0.000018 fl; Read: 0.463764 rq, 0.000035 fl
Write: 0.464914 rq, 0.000012 fl; Read: 0.419703 rq, 0.000024 fl
Write: 0.499686 rq, 0.000014 fl; Read: 0.422557 rq, 0.000023 fl
Write: 0.437960 rq, 0.000015 fl; Read: 0.396530 rq, 0.000023 fl
Meaning up to half a second is being spent issuing requests, but
almost no time is spent in flushes. The time spent in requests scales
with the size of the messages, but the time spent in flushes stays the
same.
I'm compiling this with mpicxx acc.cpp -o acc -std=gnu++11 -O3.
Any suggestions? Am I using this incorrectly?
Ben
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

--
Marc-Andre Hermanns
Jülich Aachen Research Alliance,
High Performance Computing (JARA-HPC)
Jülich Supercomputing Centre (JSC)

Wilhelm-Johnen-Str.
52425 Jülich
Germany

Phone: +49 2461 61 2509 | +49 241 80 24381
Fax: +49 2461 80 6 99753
www.jara.org/jara-hpc
email: ***@fz-juelich.de

Benjamin Brock

2017-05-04 19:27:45 UTC

Permalink

Is there any way to issue simultaneous MPI_Accumulate() requests to
different targets, then? I need to update a distributed array, and this
serializes all of the communication.

Ben

On Thu, May 4, 2017 at 5:53 AM, Marc-AndrÃ© Hermanns <

Post by Marc-AndrÃ© Hermanns
Dear Benjamin,
as far as I understand the MPI standard, RMA operations non-blocking
in the sense that you need to complete them with a separate call
(flush/unlock/...).
I cannot find the place in the standard right now, but I think an
implementation is allowed to either buffer RMA requests or block until
the RMA operation can be initiated, and the user should not assume
either. I have seen the one and the other behavior across
implementations in the past.
For your second question, yes, flush is supposed to block until remote
completion of the operation.
That said, I think to recall that Open-MPI 1.x did not support
asynchronous target-side progress for passive-target synchronization
(which is used in your benchmark example), so the behavior you
observed is to some extent expected.
Cheers,
Marc-Andre

--
Marc-Andre Hermanns
JÃŒlich Aachen Research Alliance,
High Performance Computing (JARA-HPC)
JÃŒlich Supercomputing Centre (JSC)
Wilhelm-Johnen-Str.
52425 JÃŒlich
Germany
Phone: +49 2461 61 2509 | +49 241 80 24381
Fax: +49 2461 80 6 99753
www.jara.org/jara-hpc
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Marc-André Hermanns

2017-05-05 07:39:42 UTC

Permalink

Ben,

I would regard the serialization an implementation issue not a
standards issue, thus it would still be a valid approach to perform
the operations in the way the benchmark does.

As far as I know, Nathan Hjelm did a major overhaul of the RMA
handling in Open-MPI 2.x, so my first suggestion would be to update
your installation to the latest Open-MPI and check the outcome.

That said, I think I saw similar issues with a local installation of
Open-MPI 2.0.1 that I wanted to talk to Nathan about. I still have to
investigate this further, as I currently cannot rule out a
user/configuration error on my part.

The general problem here is that in passive-target synchronization the
target cannot easily 'help' in getting things done. If your operation
needs anything that the NIC cannot do on its own via DMA, you will
need to get the target involved somehow.

Off the cuff, I can think of three ways of handling such situations
(there might be more, but I am not an implementor):

(1) Have a separate progress-thread running on the target to handle
RMA operations transparently.

(2) Have the target react to interrupts issued by the NIC to handle
incoming communication.

(3) Have the RMA engine check pending requests every time MPI is called.

I think Open-MPI 1.x was using approach (3), but Nathan should correct
me if I am wrong. Version 2.x should offload as much as possible to
the NIC, but may still need target intervention on some operations.

@Nathan: Do you have any suggestions on tuning for the Open-MPI
implementation?

Cheers,
Marc-Andre

Post by Benjamin Brock
Is there any way to issue simultaneous MPI_Accumulate() requests to
different targets, then? I need to update a distributed array, and
this serializes all of the communication.
Ben
On Thu, May 4, 2017 at 5:53 AM, Marc-AndrÃ© Hermanns
Dear Benjamin,
as far as I understand the MPI standard, RMA operations non-blocking
in the sense that you need to complete them with a separate call
(flush/unlock/...).
I cannot find the place in the standard right now, but I think an
implementation is allowed to either buffer RMA requests or block until
the RMA operation can be initiated, and the user should not assume
either. I have seen the one and the other behavior across
implementations in the past.
For your second question, yes, flush is supposed to block until remote
completion of the operation.
That said, I think to recall that Open-MPI 1.x did not support
asynchronous target-side progress for passive-target synchronization
(which is used in your benchmark example), so the behavior you
observed is to some extent expected.
Cheers,
Marc-Andre

Post by Benjamin Brock
MPI_Accumulate() is meant to be non-blocking, and MPI will block

until

Post by Benjamin Brock
completion when an MPI_Win_flush() is called, correct?
In this (https://hastebin.com/raw/iwakacadey

<https://hastebin.com/raw/iwakacadey>) microbenchmark,

Post by Benjamin Brock
MPI_Accumulate() seems to be blocking for me in OpenMPI 1.10.6.
I'm seeing timings like
Write: 0.499229 rq, 0.000018 fl; Read: 0.463764 rq, 0.000035 fl
Write: 0.464914 rq, 0.000012 fl; Read: 0.419703 rq, 0.000024 fl
Write: 0.499686 rq, 0.000014 fl; Read: 0.422557 rq, 0.000023 fl
Write: 0.437960 rq, 0.000015 fl; Read: 0.396530 rq, 0.000023 fl
Meaning up to half a second is being spent issuing requests, but
almost no time is spent in flushes. The time spent in requests

scales

Post by Benjamin Brock
with the size of the messages, but the time spent in flushes

stays the

Post by Benjamin Brock
same.
I'm compiling this with mpicxx acc.cpp -o acc -std=gnu++11 -O3.
Any suggestions? Am I using this incorrectly?
Ben
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
--
Marc-Andre Hermanns
JÃŒlich Aachen Research Alliance,
High Performance Computing (JARA-HPC)
JÃŒlich Supercomputing Centre (JSC)
Wilhelm-Johnen-Str.
52425 JÃŒlich
Germany
Phone: +49 2461 61 2509 <tel:%2B49%202461%2061%202509> | +49 241
80 24381 <tel:%2B49%20241%2080%2024381>
Fax: +49 2461 80 6 99753 <tel:%2B49%202461%2080%206%2099753>
www.jara.org/jara-hpc <http://www.jara.org/jara-hpc>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

--
Marc-Andre Hermanns
JÃŒlich Aachen Research Alliance,
High Performance Computing (JARA-HPC)
JÃŒlich Supercomputing Centre (JSC)

Wilhelm-Johnen-Str.
52425 JÃŒlich
Germany

Phone: +49 2461 61 2509 | +49 241 80 24381
Fax: +49 2461 80 6 99753
www.jara.org/jara-hpc
email: ***@fz-juelich.de