[OMPI users] Strange benchmarks at large message sizes

Discussion:

Cooper Burns

2017-09-19 14:56:34 UTC

Hello,

I have been running some simple benchmarks and saw some strange behaviour:
All tests are done on 4 nodes with 24 cores each (total of 96 mpi processes)

When I run MPI_Allreduce() I see the run time spike up (about 10x) when I
go from reducing a total of 4096KB to 8192KB for example, when count is
2^21 (8192 kb of 4 byte ints):

MPI_Allreduce(send_buf, recv_buf, count, MPI_SUM, MPI_COMM_WORLD)

is slower than:

MPI_Allreduce(send_buf, recv_buf, count*/2*, MPI_INT, MPI_SUM,
MPI_COMM_WORLD)
MPI_Allreduce(send_buf* + count/2*, recv_buf *+ count/2*, count*/2*,MPI_INT,
MPI_SUM, MPI_COMM_WORLD)

Just wondering if anyone knows what the cause of this behaviour is.

Thanks!
Cooper

Cooper Burns
Senior Research Engineer
<https://www.linkedin.com/company/convergent-science-inc>
<https://www.facebook.com/ConvergentScience>
<https://twitter.com/convergecfd>
<https://www.youtube.com/user/convergecfd> <https://vimeo.com/convergecfd>
(608) 230-1551
convergecfd.com
<https://convergecfd.com/?utm_source=Email&utm_medium=signature&utm_campaign=CSIEmailSignature>

Howard Pritchard

2017-09-19 20:44:02 UTC

Permalink

Hello Cooper

Could you rerun your test with the following env. variable set

export OMPI_MCA_coll=self,basic,libnbc

and see if that helps?

Also, what type of interconnect are you using - ethernet, IB, ...?

Howard

Post by Cooper Burns
Hello,
All tests are done on 4 nodes with 24 cores each (total of 96 mpi processes)
When I run MPI_Allreduce() I see the run time spike up (about 10x) when I
go from reducing a total of 4096KB to 8192KB for example, when count is
MPI_Allreduce(send_buf, recv_buf, count, MPI_SUM, MPI_COMM_WORLD)
MPI_Allreduce(send_buf, recv_buf, count*/2*, MPI_INT, MPI_SUM,
MPI_COMM_WORLD)
MPI_Allreduce(send_buf* + count/2*, recv_buf *+ count/2*, count*/2*,MPI_INT,
MPI_SUM, MPI_COMM_WORLD)
Just wondering if anyone knows what the cause of this behaviour is.
Thanks!
Cooper
Cooper Burns
Senior Research Engineer
<https://www.linkedin.com/company/convergent-science-inc>
<https://www.facebook.com/ConvergentScience>
<https://twitter.com/convergecfd>
<https://www.youtube.com/user/convergecfd>
<https://vimeo.com/convergecfd>
(608) 230-1551
convergecfd.com
<https://convergecfd.com/?utm_source=Email&utm_medium=signature&utm_campaign=CSIEmailSignature>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

Cooper Burns

2017-09-21 14:10:04 UTC

Permalink

Ok I tried that ( sorry for delay... Network issues killed our cluster )

Setting the env variable you suggested changed results, but all it did was
to move the run time spike from between 4mb and 8mb to between 32kb and 64kb

The nodes I'm running on *have* infiniband but i think I am running on
ethernet for these tests.

Any other ideas?

Thanks!
Cooper

Cooper Burns
Senior Research Engineer
<https://www.linkedin.com/company/convergent-science-inc>
<https://www.facebook.com/ConvergentScience>
<https://twitter.com/convergecfd>
<https://www.youtube.com/user/convergecfd> <https://vimeo.com/convergecfd>
(608) 230-1551
convergecfd.com
<https://convergecfd.com/?utm_source=Email&utm_medium=signature&utm_campaign=CSIEmailSignature>

Post by Howard Pritchard
Hello Cooper
Could you rerun your test with the following env. variable set
export OMPI_MCA_coll=self,basic,libnbc
and see if that helps?
Also, what type of interconnect are you using - ethernet, IB, ...?
Howard

_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

Gilles Gouaillardet

2017-09-21 14:52:33 UTC

Permalink

Unless you are using mxm, you can disable tcp with

mpirun --mca pml ob1 --mca btl ^tcp ...

coll/tuned select an algorithm based on communicator size and message size. The spike could occur because a suboptimal (on your cluster and with your job topology) algo is selected.

Note you can force an algo, or redefine the rules of algo selection.

Cheers,

Gilles

Post by Cooper Burns
Ok I tried that ( sorry for delay... Network issues killed our cluster )
Setting the env variable you suggested changed results, but all it did was to move the run time spike from between 4mb and 8mb to between 32kb and 64kb
The nodes I'm running on have infiniband but i think I am running on ethernet for these tests.
Any other ideas?
Thanks!
Cooper
Cooper Burns
Senior Research Engineer
ï¿ŒÂ ï¿ŒÂ ï¿ŒÂ ï¿ŒÂ ï¿Œ
ï¿Œ
(608) 230-1551
ï¿Œconvergecfd.com
ï¿Œ
Hello Cooper
Could you rerun your test with the following env. variable set
export OMPI_MCA_coll=self,basic,libnbc
and see if that helps?
Also, what type of interconnect are you using - ethernet, IB, ...?
Howard
Hello,
All tests are done on 4 nodes with 24 cores each (total of 96 mpi processes)
MPI_Allreduce(send_buf, recv_buf, count, MPI_SUM, MPI_COMM_WORLD)
MPI_Allreduce(send_buf, recv_buf, count/2, MPI_INT, MPI_SUM, MPI_COMM_WORLD)
MPI_Allreduce(send_buf + count/2, recv_buf + count/2, count/2,MPI_INT,Â MPI_SUM, MPI_COMM_WORLD)
Just wondering if anyone knows what the cause of this behaviour is.
Thanks!
Cooper
Cooper Burns
Senior Research Engineer
ï¿ŒÂ ï¿ŒÂ ï¿ŒÂ ï¿ŒÂ ï¿Œ
ï¿Œ
(608) 230-1551
ï¿Œconvergecfd.com
ï¿Œ
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users