[OMPI users] Message reception not getting pipelined with TCP

Discussion:

Samuel Thibault

2017-07-20 22:42:17 UTC

Hello,

We are getting a strong performance issue, which is due to a missing
pipelining behavior from OpenMPI when running over TCP. I have attached
a test case. Basically what it does is

if (myrank == 0) {
for (i = 0; i < N; i++)
MPI_Isend(...);
} else {
for (i = 0; i < N; i++)
MPI_Irecv(...);
}
for (i = 0; i < N; i++)
MPI_Wait(...);

with corresponding printfs. And the result is:

0.182620: Isend 0 begin
0.182761: Isend 0 end
0.182766: Isend 1 begin
0.182782: Isend 1 end
...
0.183911: Isend 49 begin
0.183915: Isend 49 end
0.199028: Irecv 0 begin
0.199068: Irecv 0 end
0.199070: Irecv 1 begin
0.199072: Irecv 1 end
...
0.199187: Irecv 49 begin
0.199188: Irecv 49 end
0.233948: Isend 0 done!
0.269895: Isend 1 done!
...
1.982475: Isend 49 done!
1.984065: Irecv 0 done!
1.984078: Irecv 1 done!
...
1.984131: Irecv 49 done!

i.e. almost two seconds happen between the start of the application and
the first Irecv completes, and then all Irecv complete immediately too,
i.e. it seems the communications were grouped altogether.

This is really bad, because in our real use case, we trigger
computations after each MPI_Wait calls, and we use several messages so
as to pipeline things: the first computation can start as soon as one
message gets received, thus overlapped with further receptions.

This problem is only with openmpi on TCP, I'm not getting this behavior
with openmpi on IB, and I'm not getting it either with mpich or madmpi:

0.182168: Isend 0 begin
0.182235: Isend 0 end
0.182237: Isend 1 begin
0.182242: Isend 1 end
...
0.182842: Isend 49 begin
0.182844: Isend 49 end
0.200505: Irecv 0 begin
0.200564: Irecv 0 end
0.200567: Irecv 1 begin
0.200569: Irecv 1 end
...
0.201233: Irecv 49 begin
0.201234: Irecv 49 end
0.269511: Isend 0 done!
0.273154: Irecv 0 done!
0.341054: Isend 1 done!
0.344507: Irecv 1 done!
...
3.767726: Isend 49 done!
3.770637: Irecv 49 done!

There we do have pipelined reception.

Is there a way to get the second, pipelined behavior with openmpi on
TCP?

Samuel

George Bosilca

2017-07-21 00:05:34 UTC

Permalink

Sam,

Open MPI aggregates messages only when network constraints prevent the
messages from being timely delivered. In this particular case I think that
our delayed business card exchange and connection setup is delaying the
delivery of the first batch of messages (and the BTL will aggregate them
while waiting for the connection to be correctly setup).

Can you reproduce the same behavior after the first batch of messages ?

Assuming the times showed on the left of your messages are correct, the
first MPI seems to deliver the entire set of messages significantly faster
than the second.

George.

Post by Samuel Thibault
Hello,
We are getting a strong performance issue, which is due to a missing
pipelining behavior from OpenMPI when running over TCP. I have attached
a test case. Basically what it does is
if (myrank == 0) {
for (i = 0; i < N; i++)
MPI_Isend(...);
} else {
for (i = 0; i < N; i++)
MPI_Irecv(...);
}
for (i = 0; i < N; i++)
MPI_Wait(...);
0.182620: Isend 0 begin
0.182761: Isend 0 end
0.182766: Isend 1 begin
0.182782: Isend 1 end
...
0.183911: Isend 49 begin
0.183915: Isend 49 end
0.199028: Irecv 0 begin
0.199068: Irecv 0 end
0.199070: Irecv 1 begin
0.199072: Irecv 1 end
...
0.199187: Irecv 49 begin
0.199188: Irecv 49 end
0.233948: Isend 0 done!
0.269895: Isend 1 done!
...
1.982475: Isend 49 done!
1.984065: Irecv 0 done!
1.984078: Irecv 1 done!
...
1.984131: Irecv 49 done!
i.e. almost two seconds happen between the start of the application and
the first Irecv completes, and then all Irecv complete immediately too,
i.e. it seems the communications were grouped altogether.
This is really bad, because in our real use case, we trigger
computations after each MPI_Wait calls, and we use several messages so
as to pipeline things: the first computation can start as soon as one
message gets received, thus overlapped with further receptions.
This problem is only with openmpi on TCP, I'm not getting this behavior
0.182168: Isend 0 begin
0.182235: Isend 0 end
0.182237: Isend 1 begin
0.182242: Isend 1 end
...
0.182842: Isend 49 begin
0.182844: Isend 49 end
0.200505: Irecv 0 begin
0.200564: Irecv 0 end
0.200567: Irecv 1 begin
0.200569: Irecv 1 end
...
0.201233: Irecv 49 begin
0.201234: Irecv 49 end
0.269511: Isend 0 done!
0.273154: Irecv 0 done!
0.341054: Isend 1 done!
0.344507: Irecv 1 done!
...
3.767726: Isend 49 done!
3.770637: Irecv 49 done!
There we do have pipelined reception.
Is there a way to get the second, pipelined behavior with openmpi on
TCP?
Samuel
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Gilles Gouaillardet

2017-07-21 01:57:36 UTC

Permalink

Sam,

this example is using 8 MB size messages

if you are fine with using more memory, and your application should not
generate too much unexpected messages, then you can bump the eager_limit
for example

mpirun --mca btl_tcp_eager_limit $((8*1024*1024+128)) ...

worked for me

George,

in master, i thought

mpirun --mca btl_tcp_progress_thread 1 ...

would help but it did not.
did i misunderstand the purpose of the TCP progress thread ?

Cheers,

Gilles

Post by George Bosilca
Sam,
Open MPI aggregates messages only when network constraints prevent the
messages from being timely delivered. In this particular case I think
that our delayed business card exchange and connection setup is
delaying the delivery of the first batch of messages (and the BTL will
aggregate them while waiting for the connection to be correctly setup).
Can you reproduce the same behavior after the first batch of messages ?
Assuming the times showed on the left of your messages are correct,
the first MPI seems to deliver the entire set of messages
significantly faster than the second.
George.
On Thu, Jul 20, 2017 at 5:42 PM, Samuel Thibault
Hello,
We are getting a strong performance issue, which is due to a missing
pipelining behavior from OpenMPI when running over TCP. I have attached
a test case. Basically what it does is
if (myrank == 0) {
for (i = 0; i < N; i++)
MPI_Isend(...);
} else {
for (i = 0; i < N; i++)
MPI_Irecv(...);
}
for (i = 0; i < N; i++)
MPI_Wait(...);
0.182620: Isend 0 begin
0.182761: Isend 0 end
0.182766: Isend 1 begin
0.182782: Isend 1 end
...
0.183911: Isend 49 begin
0.183915: Isend 49 end
0.199028: Irecv 0 begin
0.199068: Irecv 0 end
0.199070: Irecv 1 begin
0.199072: Irecv 1 end
...
0.199187: Irecv 49 begin
0.199188: Irecv 49 end
0.233948: Isend 0 done!
0.269895: Isend 1 done!
...
1.982475: Isend 49 done!
1.984065: Irecv 0 done!
1.984078: Irecv 1 done!
...
1.984131: Irecv 49 done!
i.e. almost two seconds happen between the start of the
application and
the first Irecv completes, and then all Irecv complete immediately too,
i.e. it seems the communications were grouped altogether.
This is really bad, because in our real use case, we trigger
computations after each MPI_Wait calls, and we use several messages so
as to pipeline things: the first computation can start as soon as one
message gets received, thus overlapped with further receptions.
This problem is only with openmpi on TCP, I'm not getting this behavior
0.182168: Isend 0 begin
0.182235: Isend 0 end
0.182237: Isend 1 begin
0.182242: Isend 1 end
...
0.182842: Isend 49 begin
0.182844: Isend 49 end
0.200505: Irecv 0 begin
0.200564: Irecv 0 end
0.200567: Irecv 1 begin
0.200569: Irecv 1 end
...
0.201233: Irecv 49 begin
0.201234: Irecv 49 end
0.269511: Isend 0 done!
0.273154: Irecv 0 done!
0.341054: Isend 1 done!
0.344507: Irecv 1 done!
...
3.767726: Isend 49 done!
3.770637: Irecv 49 done!
There we do have pipelined reception.
Is there a way to get the second, pipelined behavior with openmpi on
TCP?
Samuel
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

George Bosilca

2017-07-21 20:00:52 UTC

Permalink

Post by Gilles Gouaillardet
Sam,
this example is using 8 MB size messages
if you are fine with using more memory, and your application should not
generate too much unexpected messages, then you can bump the eager_limit
for example
mpirun --mca btl_tcp_eager_limit $((8*1024*1024+128)) ...
worked for me

Ah, interesting. If forcing a very large eager then the problem might be
coming from the pipelining algorithm. Not a good solution in general, but
handy to see what's going on. As many sends are available, the pipelining
might be overwhelmed and interleave fragments from different requests. Let
me dig a little bit here, I think I know exactly what is going on.

Post by Gilles Gouaillardet
George,
in master, i thought
mpirun --mca btl_tcp_progress_thread 1 ...
would help but it did not.
did i misunderstand the purpose of the TCP progress thread ?

Gilles,

In this example most of the time is spent in an MPI_* function (mainly the
MPI_Wait), so the progress thread has little opportunity to help. The role
of the progress thread is to make sure communications are progressed when
the application is not into an MPI call.

George.

Post by Gilles Gouaillardet
Cheers,
Gilles

Post by George Bosilca
Sam,
Open MPI aggregates messages only when network constraints prevent the
messages from being timely delivered. In this particular case I think that
our delayed business card exchange and connection setup is delaying the
delivery of the first batch of messages (and the BTL will aggregate them
while waiting for the connection to be correctly setup).
Can you reproduce the same behavior after the first batch of messages ?
Assuming the times showed on the left of your messages are correct, the
first MPI seems to deliver the entire set of messages significantly faster
than the second.
George.
On Thu, Jul 20, 2017 at 5:42 PM, Samuel Thibault <
Hello,
We are getting a strong performance issue, which is due to a missing
pipelining behavior from OpenMPI when running over TCP. I have attached
a test case. Basically what it does is
if (myrank == 0) {
for (i = 0; i < N; i++)
MPI_Isend(...);
} else {
for (i = 0; i < N; i++)
MPI_Irecv(...);
}
for (i = 0; i < N; i++)
MPI_Wait(...);
0.182620: Isend 0 begin
0.182761: Isend 0 end
0.182766: Isend 1 begin
0.182782: Isend 1 end
...
0.183911: Isend 49 begin
0.183915: Isend 49 end
0.199028: Irecv 0 begin
0.199068: Irecv 0 end
0.199070: Irecv 1 begin
0.199072: Irecv 1 end
...
0.199187: Irecv 49 begin
0.199188: Irecv 49 end
0.233948: Isend 0 done!
0.269895: Isend 1 done!
...
1.982475: Isend 49 done!
1.984065: Irecv 0 done!
1.984078: Irecv 1 done!
...
1.984131: Irecv 49 done!
i.e. almost two seconds happen between the start of the
application and
the first Irecv completes, and then all Irecv complete immediately too,
i.e. it seems the communications were grouped altogether.
This is really bad, because in our real use case, we trigger
computations after each MPI_Wait calls, and we use several messages so
as to pipeline things: the first computation can start as soon as one
message gets received, thus overlapped with further receptions.
This problem is only with openmpi on TCP, I'm not getting this behavior
0.182168: Isend 0 begin
0.182235: Isend 0 end
0.182237: Isend 1 begin
0.182242: Isend 1 end
...
0.182842: Isend 49 begin
0.182844: Isend 49 end
0.200505: Irecv 0 begin
0.200564: Irecv 0 end
0.200567: Irecv 1 begin
0.200569: Irecv 1 end
...
0.201233: Irecv 49 begin
0.201234: Irecv 49 end
0.269511: Isend 0 done!
0.273154: Irecv 0 done!
0.341054: Isend 1 done!
0.344507: Irecv 1 done!
...
3.767726: Isend 49 done!
3.770637: Irecv 49 done!
There we do have pipelined reception.
Is there a way to get the second, pipelined behavior with openmpi on
TCP?
Samuel
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Gilles Gouaillardet

2017-07-22 04:42:50 UTC

Permalink

Thanks George for the explanation,

with the default eager size, the first message is received *after* the
last message is sent, regardless the progress thread is used or not.
an other way to put it is that MPI_Isend() (and probably MPI_Irecv()
too) do not involve any progression,
so i naively thought the progress thread would have helped here.

just to be 100% sure, could you please confirm this is the intended
behavior and not a bug ?

Cheers,

Gilles

Post by George Bosilca

Post by Gilles Gouaillardet
George,
in master, i thought
mpirun --mca btl_tcp_progress_thread 1 ...
would help but it did not.
did i misunderstand the purpose of the TCP progress thread ?

Gilles,
In this example most of the time is spent in an MPI_* function (mainly the
MPI_Wait), so the progress thread has little opportunity to help. The role
of the progress thread is to make sure communications are progressed when
the application is not into an MPI call.
George.

Post by Gilles Gouaillardet
Cheers,
Gilles

Post by George Bosilca
Sam,
Open MPI aggregates messages only when network constraints prevent the
messages from being timely delivered. In this particular case I think that
our delayed business card exchange and connection setup is delaying the
delivery of the first batch of messages (and the BTL will aggregate them
while waiting for the connection to be correctly setup).
Can you reproduce the same behavior after the first batch of messages ?
Assuming the times showed on the left of your messages are correct, the
first MPI seems to deliver the entire set of messages significantly faster
than the second.
George.
On Thu, Jul 20, 2017 at 5:42 PM, Samuel Thibault
Hello,
We are getting a strong performance issue, which is due to a missing
pipelining behavior from OpenMPI when running over TCP. I have attached
a test case. Basically what it does is
if (myrank == 0) {
for (i = 0; i < N; i++)
MPI_Isend(...);
} else {
for (i = 0; i < N; i++)
MPI_Irecv(...);
}
for (i = 0; i < N; i++)
MPI_Wait(...);
0.182620: Isend 0 begin
0.182761: Isend 0 end
0.182766: Isend 1 begin
0.182782: Isend 1 end
...
0.183911: Isend 49 begin
0.183915: Isend 49 end
0.199028: Irecv 0 begin
0.199068: Irecv 0 end
0.199070: Irecv 1 begin
0.199072: Irecv 1 end
...
0.199187: Irecv 49 begin
0.199188: Irecv 49 end
0.233948: Isend 0 done!
0.269895: Isend 1 done!
...
1.982475: Isend 49 done!
1.984065: Irecv 0 done!
1.984078: Irecv 1 done!
...
1.984131: Irecv 49 done!
i.e. almost two seconds happen between the start of the
application and
the first Irecv completes, and then all Irecv complete immediately too,
i.e. it seems the communications were grouped altogether.
This is really bad, because in our real use case, we trigger
computations after each MPI_Wait calls, and we use several messages so
as to pipeline things: the first computation can start as soon as one
message gets received, thus overlapped with further receptions.
This problem is only with openmpi on TCP, I'm not getting this behavior
0.182168: Isend 0 begin
0.182235: Isend 0 end
0.182237: Isend 1 begin
0.182242: Isend 1 end
...
0.182842: Isend 49 begin
0.182844: Isend 49 end
0.200505: Irecv 0 begin
0.200564: Irecv 0 end
0.200567: Irecv 1 begin
0.200569: Irecv 1 end
...
0.201233: Irecv 49 begin
0.201234: Irecv 49 end
0.269511: Isend 0 done!
0.273154: Irecv 0 done!
0.341054: Isend 1 done!
0.344507: Irecv 1 done!
...
3.767726: Isend 49 done!
3.770637: Irecv 49 done!
There we do have pipelined reception.
Is there a way to get the second, pipelined behavior with openmpi on
TCP?
Samuel
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Samuel Thibault

2017-07-21 02:57:39 UTC

Permalink

Hello,

Post by George Bosilca
Can you reproduce the same behavior after the first batch of messages ?

Yes, putting a loop around the whole series of communications, event
with a 1-second pause in between, gets the same behavior repeated.

Post by George Bosilca
Assuming the times showed on the left of your messages are correct, the first
MPI seems to deliver the entire set of messages significantly faster than the
second.

The second log was with mpich2.

Samuel

Samuel Thibault

2017-07-21 02:57:58 UTC

Permalink

Post by Gilles Gouaillardet
if you are fine with using more memory, and your application should not
generate too much unexpected messages, then you can bump the eager_limit
for example
mpirun --mca btl_tcp_eager_limit $((8*1024*1024+128)) ...

Thanks for the workaround! Normally we shouldn't have many unexpected
messages, the memory consumption would be concerning, though.

Samuel